An alert comes in

It’s 11:00am on a Sunday; an alert comes in. And another. And another.

  • In total, you see 5 alerts all related to similar issues. All alerts indicate tenants are down.
  • Challenge your assumptions.

No one would be working on the weekend making changes, so you dive into the problem. You see the 5 broken containers, and they complain about missing infrastructure.

  • You don’t find the infrastructure you expect. Did someone delete the infrastructure?
  • Challenge your assumptions.

You begin to look for logs. There should be an event within an hour of the alert going off.

  • You don’t find matching logs for any infrastructure being removed.
  • Challenge your assumptions.

You check the system of record for this first environment and notice it still shows online.

  • This environment should be here
  • Challenge your assumptions.

Finally, a thought. Let me check the system of record for all environments. Wait a second…

  • Every other environment shows decommissioned.
  • There are tickets saying to decom these environments.
  • Your assumptions were wrong.

The first environment probably just didn’t get updated in the system of record.


Takeaways

  • People sometimes do work on the weekend.
  • Alerts sometime indicate a problem on systems that shouldn’t be up in the first place.
  • Systems of record are sometimes out of date.
  • Logs wouldn’t exist if this isn’t an actual issue.