enardeitjaptrues

Mean time to resolve (MTTR) isn’t a viable metric for measuring the reliability or security of complex software systems and should be replaced by other, more trustworthy options. That’s according to a new report from Verica which argued that the use of MTTR to gauge software network failures and outages is not appropriate, partly due to the distribution of duration data and because failures in such systems don’t arrive uniformly over time. Site reliability engineering (SRE) teams and others in similar roles should therefore retire MTTR as a key metric, instead looking to other strategies including service level objectives (SLOs) and post-incident data review, the report stated.

MTTR metric not descriptive of system reliability

MTTR originated in manufacturing organizations to measure the average time required to repair a failed physical component or device, the second annual Verica Open Incident Database (VOID) Report, read. However, such devices had simpler, predictable operations with wear and tear that lent themselves to reasonably standard and consistent estimates of MTTR, it added. “Over time the use of MTTR has expanded to software systems, software companies view it as an indicator of system reliability and team agility/effectiveness.”

Verica researchers predicted that MTTR was not an appropriate metric for complex software systems. “Each failure is inherently different, unlike issues with physical manufacturing devices. Operators of modern software systems regularly invest in improving the reliability of their systems, only to be caught off guard by unexpected and unusual failures.”

“MTTR is appealing because it appears to make clear, concrete sense of what are really messy, surprising situations that don’t lend themselves to simple summaries, but MTTR has too much variance in the underlying data to be a measure of system reliability,” Courtney Nash, lead researcher, Verica, tells CSO. “It also tells us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result,” she adds. The same set of technological circumstances could conceivably go a lot of different ways depending on the responders, what they know or don’t know, their risk appetite and internal pressures, Nash says.

With incident data collected in the report, Verica claimed it was able to show that MTTR is not descriptive of complex software system reliability, conducting two experiments to test MTTR reliability based on previous findings published by Štěpán Davidovič in Incident Metrics in SRE: Critically Evaluating MTTR and Friends. The results showed that reducing incident duration by 10% did not cause a reliable reduction in the calculated MTTR, regardless of sample size (e.g., total number of incidents), the report stated. “Our results [also] highlight how much the extreme variance in duration data can impact calculated changes in MTTR.”

Implementing alternatives to the MTTR metric

A single averaged number should have never been used to measure or represent the reliability of complex software systems, the report read. “No matter what your (unreliable) MTTR might seem to indicate, you’d still need to investigate your incidents to understand what is truly happening with your systems.” However, moving away from MTTR isn’t just swapping one metric for another; it’s a mindset shift, Nash says. “Much the way the early DevOps movement was as much about changing culture as technology, organizations that embrace data-driven decisions and empower people to enact change when and where necessary, will be able to reckon with a metric that isn’t useful and adapt.”