In recent times, I’ve been realizing more and more just how much a
screwed up management situation can lead to screwed up technical
situations. I’ve written a bit about this in the past few months, and
got to thinking about a specific anecdote from not too long ago.

I was working on a team which was supposed to be the “last line of
defense” for outages and other badness like that. We kept having issues
with this one service run by this team which ran on every system in the
fleet and was essential for keeping things going (you know, the cat
pics). We couldn’t figure out why it kept happening.

Eventually, I wound up transferring from my “fixer” team and into the
organization which contained the team in question, and my first “tour of
duty” was to embed with that team to figure out what was going on. What
I found was interesting.

The original team had been founded some years before, but none of those
original members were still there. They had moved on to other things
inside the company. There was one person who had joined the team while
the original people were still there, and at this point, he was the only
one left who had “overlapped” with the original devs.

What I found was that this one person who had history going back to when
the “OGs” were still around was basically carrying the load of the
entire team. Everyone else was very new, and so it was up to him.

I got to know him, and found out that he wasn’t batshit or even
malicious. He was just under WAY too much load, and was shipping
insanity as a result. Somehow, we managed to call timeout and got them
to stop shipping broken things for a while. Then I got lucky and
intercepted a few of the zanier ideas while he was still under the
stupid-high load, and we got some other people to step up and start
spreading the load around.

I pitched in too, like trying to help some of the irked customers of the
team and do some general “customer service” work. My thinking was that
if I could do some “firewall” type work on behalf of the team, it would
give them some headroom so they could relax and figure out how to move
forward.

This pretty much worked. The surprise came later, when the biannual
review cycle started up and the “calibration sessions” got rolling.
They wanted to give this person some bullshit sub-par rating. I
basically said that if they give him anything less than “meets
expectations”, I would be royally pissed off, since it wasn’t his fault.

What’s kind of interesting is that they asked the same question of one
of my former teammates (who had also been dealing with the fallout from
these same reliability issues), and he said the same thing! We didn’t
know we had both been asked about it until much later. We hadn’t even
discussed the situation with the overloaded engineer. It was just
apparent to both of us.

With both of us giving the same feedback, they took it seriously, and
didn’t hose him over on the review. He went on to do some pretty
interesting stuff for monitoring and other new stuff (including bouncing
it off the rest of the team first), and eventually shoved off for
(hopefully) happier shores.

The service, meanwhile, got way better at not breaking things. The team
seemed to gel in a way that it hadn’t before. It even pulled through a
truly crazy Friday night event that you’d think would have caused a full
site outage, but didn’t. Everyone came together and worked the problem.
The biggest impact was that nobody internally could ship new features
for a couple of hours while we figured it out and brought things back to
normal. The outside world never noticed.

Not long after that event, I considered the team “graduated” and that I
no longer needed to embed with them, and went off to the next wacky team
in that particular slice of the company’s infra organization.

This was never a tech problem. It was one guy with 3 or 4 people worth
of load riding on his shoulders who was doing his very best but was
still very much human and so was breaking down under the stress. They
tried to throw him under the bus post-facto, but we wouldn’t stand for
it. This was a management problem for letting it happen in the first
place.

See how it works?

Read More