A real story about life-critical IT (aka how 5 minutes of work kept me awake all night )

Posted: 2020-05-21 | Categories: tech

Back in 2009-2010, when I was a co-founder of Coblan srl (now defunct) we had won our first tender to manage all the infrastructure of two significant public hospitals receiving hundred of thousand of patients every year. We were two weeks in the contract, still trying to get a sense of the very complicated set of services, privacy needs and security requirement (i.e. 800 VMs, TB storage on Hitachi storage, massive 100+ CPU systems and (too) many Oracle Databases in cluster). From a sysadmin perspective, it was to be fun, but the learning curve steep.

Then, that night at ** 3 am**, while I was in my small hotel room, without wifi (2009, remember?), I had a short call from the hospital which in summary said The blood bank software is not working, we have a patient in surgery, you have 5 minutes. Not fun, not fun at all, and no Internet.

So, I woke up everyone in my team, and lucky me they responded in less than 2 minutes. After finding the blood bank set of servers (1 minute), manage to access the server and do one basic check (another 2 minutes) we diagnosed that the Oracle server was not doing its job but was running, so the failsafe mechanism was not triggered (2 minutes, we are now at 7).

Eventually, we forced the server to stop (kill -9 in tech term), and the failsafe mechanism was activated, restoring everything to normal.

I did not sleep for the rest of the night, waiting for another call and being wide awake with adrenaline. At 7 am, when arriving in the hospital, the only news I got was, “no-one died tonight, so it is all ok”. I don’t remember my reaction, at the point, my brain was relaxing.

Today I still have vivid memories of the events, and that’s when I realised what life-critical events really are. From that day, my reaction to non-life-threatening events, is calm. If no-one is at risk of dying, then it’s ok to be ok.