Bonecage - I Phreak Alone music video (by Phone Losers of America)
Ambient Backscatter (by uwsensor)
(via xkcd: Heartbleed)
A reliable storage system is one that can be trusted to perform well under all states of operation, and that kind of predictable performance is difficult to achieve. In a predictable system, worst-case performance is crucial; average performance not so much. In a well implemented, correctly provisioned system, average performance is very rarely a cause of concern. But throughout the company we look at metrics like p999 and p9999 latencies, so we care how slow the 0.01% slowest requests to the system are. We have to design and provision for worst-case throughput. For example, it is irrelevant that steady-state performance is acceptable, if there is a periodic bulk job that degrades performance for an hour every day.
Because of this priority to be predictable, we had to plan for good performance during any potential issue or failure mode. The customer is not interested in our implementation details or excuses; either our service works for them and for Twitter or it does not. Even if we have to make an unfavorable trade-off to protect against a very unlikely issue, we must remember that rare events are no longer rare at scale.
With scale comes not only large numbers of machines, requests and large amounts of data, but also factors of human scale in the increasing number of people who both use and support the system. We manage this by focusing on a number of concerns:
- if a customer causes a problem, the problem should be limited to that customer and not spread to others
- it should be simple, both for us and for the customer, to tell if an issue originates in the storage system or their client
- for potential issues, we must minimize the time to recovery once the problem has been detected and diagnosed
- we must be aware of how various failure modes will manifest for the customer
- an operator should not need deep, comprehensive knowledge of the storage system to complete regular tasks or diagnose and mitigate most issues
And finally, we built Manhattan with the experience that when operating at scale, complexity is one of your biggest enemies. Ultimately, simple and working trumps fancy and broken. We prefer something that is simple but works reliably, consistently and provides good visibility, over something that is fancy and ultra-optimal in theory but in practice and implementation doesn’t work well or provides poor visibility, operability, or violates other core requirements.„