Listen "Kolton Andrus on Lessons Learnt From Failure Testing at Amazon and Netflix and New Venture Gremlin"
Episode Synopsis
In this week's podcast, QCon chair Wesley Reisz talks to Kolton Andrus. Andrus is the founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website.
Why listen to this podcast:
- Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures.
- Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space.
- You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up.
- Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device.
- Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Why listen to this podcast:
- Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures.
- Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space.
- You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up.
- Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device.
- Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched.
Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
More episodes of the podcast The InfoQ Podcast
Building a More Appealing CLI for Agentic LLMs Based on Learnings from the Textual Framework
15/12/2025
How to Use Apache Spark to Craft a Multi-Year Data Regression Testing and Simulations Framework
26/11/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.