Learning from Failure at Scale

13/04/2020
Learning from Failure at Scale

Listen "Learning from Failure at Scale"

Episode Synopsis



One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.