GitLab database post-mortem

Reading post-mortems for fun and education: On January 31st 2017, we experienced a major service outage for one of our products, the online service The outage was caused by an accidental removal of data from our primary database server.

Recognizing your mistakes

One of my favorite interview questions is: Tell me of one mistake you made and what happened.  Tell me of a second mistake. Tell me of a third. Often a prospect will have one or two readily available, but have to resort to bare honesty by the third. You can learn a lot about them from […]

Google SRE book

The Site Reliability Engineering book is available online. A lot of it doesn’t scale well to small operations but there’s a lot of good tops and lessons learned in there.