Jun 11


Jun 10


Jun 04

“We’ve held that hiring bar constant through the years, even at times when it’s been very hard to find people, and there’s been a lot of pressure to relax that bar in order to increase hiring volume. We’ve never changed our standards in this respect. That has, I think, been incredibly important for the group. Because what you end up with is, a team of people who fundamentally will not accept doing things over and over by hand, but also a team that has a lot of the same academic and intellectual background as the rest of the development organization. This ensures that mutual respect and mutual vocabulary pertains between SRE and SWE.
One of the things you normally see in operations roles as opposed to engineering roles is that there’s a chasm not only with respect to duty, but also of background and of vocabulary, and eventually, of respect. This, to me, is a pathology.” — Site Reliability Engineering

May 27

“Ah, but that’s the rub. The publishers need Amazon because they need the Kindle’s DRM, because they know without that artificial friction their contribution to a book’s fixed costs would become untenable.” — Publishers’ Deal with the Devil | stratechery by Ben Thompson


May 22


May 20


May 19

What is your philosophy on build vs. buy?

In case it isn’t already obvious: for most of my career I was one of those guys who built one of everything and at some level believed that what I built was better than what other people built. As I’ve grown more experienced I’m much more protective of my time, my team’s time and our ability to focus on what differentiates us from other products. My default is buy unless there’s a compelling counter-argument.

” — CTO to CTO: Werner Vogels and Don Neufeld — Medium

May 14


May 13

“The lessons here are at least twofold. First, as complex systems age, the environmental conditions under which they operate change dramatically. Workload typically goes up, workload mix changes over time, and there will be software changes made over time some of which will change the bounds of what system can reliably handle. Knowing this, we must always be retesting production systems with current workload mixes and we must probe the bounds well beyond any reasonable production workload. As these operating and environmental conditions evolve, the testing program must be updated. I have seen complex system fail where, upon closer inspection, it’s found that the system has been operating for years beyond its design objectives and its original test envelope. Systems have to be constantly probed to failure so we know where they will operate stably and where they will fail.

The second lesson is that rare events will happen. I doubt a U2 pass of the western US is all that rare but something about this one was unusual. We need to expect that complex systems will face unexpected environmental conditions and look hard for some form degraded operations mode. Failure fault containment zones should be made as small as possible, we want to look for ways to deliver the system such that some features may fail while others continue to operate. You don’t want to lose all functionality for the entire system in a failure. For all complex systems, we are looking for ways to divide up the problem such that a fault has a small blast radius (less impact) and we want the system to gracefully degrade to less functionality rather than have the entire system hard fail. With backup systems its particularly important that they are designed to fail down to less functionality rather than also hard failing. In this case, having the backup come up and require more flight separation and do less real time collision calculations might be the right answer. Generally, all systems need to have some way to operate at less than full functionality or less than full scale without completely hard failing. These degraded operation modes are hard to find but I’ve never seen a situation where we couldn’t find one.” — Perspectives - Air Traffic Control System Failure & Complex System Testing