SLO Dashboarding
Jerome Comptdaer
* The graph for SLO uses the rolling window attribute to restrict the timeline. Would be great to dissociate it and see the history of the SLI over a different period of time (year, quarter, month, etc).
* Is there a way to present an aggregated view of SLI/SLO compliance/budget metrics representing the SLOs multiple services (Application)?
Jean Sandberg
Thanks for the requests, Jerome! We will very likely implement your first request, though it's not on the near-term roadmap. Regarding aggregation, do you mean somehow combining a metric from multiple services into a single number? Or just displaying them together on one screen?
Jerome Comptdaer
Jean Sandberg:
Great, thanks Jean for your response. Even if not visualised yet, do you already keep the information so that when the functionality becomes available, we'll be able to see the historical data ?
Yes, I'd like to ultimately visualise a single number reflecting the status of the SLO compliance for multiple services. As the number of services grows (>100), it is easier to get a high level picture of the platform and drill down into the group that may fall behind.
It could be for instance X% of the services that are SLO compliant. However, ideally, I'd like to have something more accurate which also considers the traffic served by the services (for example, in a scenario with 2 services: one serving 10k requests per hour with 100% SLI and a second serving 1 request per hour with 0% SLI, it would lead to something like 99% compliance for the group, instead of 50%).
Jean Sandberg
Jerome Comptdaer: Thanks! Before we start storing more time series information, we first need to change how we're handling/storing metrics in the back end, and that work is starting very soon.
As for an aggregate success rate, I was going to point out what you already did - that you can't just average the percentages. It's probably possible for us to do it the right way for success rate, though I talked to Engineering and it will require some changes to how we're doing things today. An aggregate would not make sense for latency SLOs. This hasn't been on our list, but if we get enough demand for it, that could change.
Within the next week, the dashboard will at least show you how many services are out of SLO.
Thanks so much for the feedback - it does help us prioritize :-)