Challenge your Assumptions
An alert comes in
It’s 11:00am on a Sunday; an alert comes in. And another. And another.
- In total, you see 5 alerts all related to similar issues. All alerts indicate tenants are down.
- Challenge your assumptions.
No one would be working on the weekend making changes, so you dive into the problem. You see the 5 broken containers, and they complain about missing infrastructure.
- You don’t find the infrastructure you expect. Did someone delete the infrastructure?
- Challenge your assumptions.
You begin to look for logs. There should be an event within an hour of the alert going off.
- You don’t find matching logs for any infrastructure being removed.
- Challenge your assumptions.
You check the system of record for this first environment and notice it still shows online.
- This environment should be here
- Challenge your assumptions.
Finally, a thought. Let me check the system of record for all environments. Wait a second…
- Every other environment shows decommissioned.
- There are tickets saying to decom these environments.
- Your assumptions were wrong.
The first environment probably just didn’t get updated in the system of record.
Takeaways
- People sometimes do work on the weekend.
- Alerts sometime indicate a problem on systems that shouldn’t be up in the first place.
- Systems of record are sometimes out of date.
- Logs wouldn’t exist if this isn’t an actual issue.