One of the lessons that I’ve had to repeatedly re-learn over my career is “understand the problem before you fix it.” I try to fix a problem as quickly as I can. It’s a laudable goal, but a fix without understanding may not actually fix the problem. And it may not prevent future occurrences. If you’re particularly unlucky, it will make the problem worse.
I learned this lesson late last week. On Thursday, someone reported some HTML appearing in some Fedora documentation on translated pages. “Oh! It was probably that PR I merged yesterday,” I thought. So I reverted it.
Then I started digging into it some more. And I realized that it’s probably not that change at all. In fact, it worked locally and on the staging server. It was just broken on the production server. It’s not clear to me if both staging and production sync the translation data on the same schedule (without getting too sidetracked, the staging environment isn’t really a staging environment. It needs a better name). But I became convinced that it’s not a problem in the docs infrastructure, but in the translations. So I reverted my reversion.
This is not the first time I jumped in to fix something before I took a look around to see what’s going on. Unfortunately, it probably won’t be the last.
Here’s the thing: most of the time, a slight delay doesn’t matter. No one’s safety was at risk. We weren’t losing hundreds of thousands of dollars a minute. There was no real harm in spending 10 minutes to figure out what was going on. Perhaps I could try to reproduce it. After all, if you can’t reproduce the error, how do you know you’ve fixed it?
Hopefully the next time I go to fix a problem, I’ll understand the problem first. As astronauts do, I need to work the problem.