In my second week as CIO of a small liberal arts college in flyover country, I had my first test of crisis management, and crisis communication.
It was a Friday afternoon (naturally), and unbeknownst to me, one of our key vendors had scheduled a modification to an existing, mission-critical drive array, expanding its capacity from 2 TB to 4 TB.
Because I am writing this post, you might have guessed that this didn’t go exactly to plan.
Rather than expanding our existing array two-fold, the entire 2 TB array was wiped out, with a few mouse-clicks.
In the business, these types of occurrences are known RGEs – Resume Generating Events.
I was hoping that it wasn’t my resume, that was now being queued up in the laser printer.
It was then I then learned, that two of my key team members were out, who normally would be responsible for the care and feeding of these systems. Welcome to Hell, here’s your banjo.
Next, I sent our campus C-level folks a short, succinct note, indicating the severity of what just happened, with scheduled check in times to keep everyone in the loop. I then set about recalling our key staff, to start the effort of putting Humpty Dumpty back together again.
It took all of us working over the weekend. But, we were able to recover – mostly – everything.
Now, in the history of catastrophes, it’s rarely one thing that causes the world to seemingly crash down around you; but rather, it’s usually a series of escalating, compounding, cascading events that lead up to a final tipping point.
For us, it began with a technician at our vendor’s data center ignoring several important processes, before giving us the literal and metaphorical finger. And, because our 2 TB was now toast, we found out that our most recent backup was over a week old.
An automated daemon process, that was supposed to tell us when our backups had failed, failed to recognize that the backup job itself had been running for well over a hundred hours. The process hadn’t failed in the sense that it had stopped running altogether. But, no one had looked at it to recognize the long running condition, either.
We had just closed our fiscal year two weeks before. Our own personal Armageddon was avoided by this much.
When it rains, it typhoons.
We did live to fight another day. By the hardest.
Wizened, I got to re-learn several important lessons that day; namely, that one should:
- Never schedule mission critical engineering work, when key support people are out of pocket;
- Never let automated processes go unmonitored and unexamined; check them – daily.
- Remember that processes and procedures exist for a reason. If you bet against that notion, then you should be prepared to pay whatever price that entails; and,
- Never – never – schedule infrastructure work for Fridays.
Because when the world ends, it will be at 4:00 pm on a Friday (probably, Holiday) afternoon.