GitLab database post-mortem

Reading post-mortems for fun and education: On January 31st 2017, we experienced a major service outage for one of our products, the online service GitLab.com. The outage was caused by an accidental removal of data from our primary database server.

Recognizing your mistakes

One of my favorite interview questions is: Tell me of one mistake you made and what happened.  Tell me of a second mistake. Tell me of a third. Often a prospect will have one or two readily available, but have to resort to bare honesty by the third. You can learn a lot about them from […]

Professor Beekums on evaluating developers

How Do You Know A Developer Is Doing A Good Job? Spoiler alert! In my opinion, the best measures are all subjective: Are they good team members? Can they solve problems? Can they write good code? Are they eager to learn? Can you trust them? I agree and add – do they get stuff done?

Google SRE book

The Site Reliability Engineering book is available online. A lot of it doesn’t scale well to small operations but there’s a lot of good tops and lessons learned in there.