[irq]: techie interrupted

30/07/2012

“ Chaos Monkey allows for an Opt-In or an Opt-Out model. At Netflix, we use the Opt-Out model, so if an application owner does nothing, Chaos Monkey will be acting on their application. For your organization, you have the option to choose what is right for you. This allows you to “test the water” and try out Chaos Monkey on a specific application to see how it reacts. Not every application can trivially handle an instance going offline. Sometimes it takes a human to manually recover instances, perhaps exercising backups to bring them back. Ideally, engineers work towards making that process easier and faster and eventually automatic. For those applications, there is the ability to Opt-Out of Chaos Monkey. There is also a tunable “probability” that Chaos Monkey uses to control the chance of a termination. A probability of 1 (or 100%) will terminate one instance per day per ASG. If instance recovery is difficult and you only want a termination weekly, you can reduce the probability to 0.2 or 20% (daily is 100%, it runs 5 work days per week, so weekly is 20%). Note that this is still a probability and only meaningful when sampled multiple times. With a 20% probability, Chaos Monkey would terminate one instance a week on average. In practice, it might be 2 days in a row followed by 2 weeks of no terminations, but given a large enough sample it will terminate weekly on average. For an environment as large as Netflix, the configuration can get a bit tricky to manage and for this we have developed a dashboard to help that we hope to open source soon. You can read more about how to configure Chaos Monkey on the documentation wiki. „

The Netflix Tech Blog: Chaos Monkey released into the wild

blog comments powered by Disqus
Tumblr » powered Sid05 » templated Disquss » commented