Listen "Learning from Failure at Scale"
Episode Synopsis
One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.
More episodes of the podcast DESIGN – rule 11 reader
Hedge 265: Out of Band Networks
04/04/2025
Architecture and Process
12/04/2024
Simple or Complex?
19/09/2023
Hedge 144: IPv6 Lessons Learned
25/08/2022
Route Servers and Loops
16/08/2022
Hedge 134: Ten Things
15/06/2022
Revisiting BGP Convergence
06/06/2022
BGP Policies (Part 2)
14/03/2022
BGP Policies (part 1)
07/03/2022
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.