The 24/7 on-call duty is a relic from a bygone era. Nowadays the benefits do not outweigh the negative effects on the health of yourself and your team anymore.
Knowing that a pager might end your sleep and require your full attention at any time is decreasing the quality of sleep. As the paper The effect on sleep of being on‐call: an experimental field study highlights. Furthermore, it is a truism to say that sleep is essential for our health. The Division of Sleep Medicine at
Harvard Medical School summarizes the effects of sleeping less than seven hours a night as follows. A lack of sleep increases the risk for obesity, diabetes, and cardiovascular disease. Besides, the immune function suffers from a lack of sleep as well which increase the chance to come down with a cold, for example.
However, the 24/7 on-call duty is still a thing in many teams who are responsible for a system. What are the arguments for being on-call even during sleeping hours?
- Money - a downtime of your system directly or indirectly affects the amount of money the owners of your company are putting into their pockets.
- Perfection - as an engineer you are aiming for perfection, no matter what it takes.
- Safety of Life - beings will suffer if your system is not working correctly.
In my opinion, the only valid reason for 24/7 on-call duty is Safety of Life. Everything else is not worth sacrificing neither your sleep nor your health.
So, what is the alternative to being on-call twenty-four hours a day on the one side and mitigating the risk of downtime impacting business on the other side?
Although you are limiting the on-call duty of your team to working hours or waking hours you will still be able to cover between 24% and 67% of your system’s operating hours. On top of that, you should take the usage pattern of your system into account as well. For example, around 80% of all incoming requests to my blog are coming in during my waking hours.
One of the most significant benefits of Cloud Computing, in general, and Amazon Web Services, in particular, is the ability to automate every part of your infrastructure. Replacing a failed machine is no longer a reason to disrupt or even worse wake up one of your team members. An Elastic Load Balancer (ELB) performing health checks and an Auto Scaling Group (ASG) are able to terminate failed machines and launch fully-operational machines automatically, for example. Building self-healing infrastructures able to recover from a failed machine or a data center outage is becoming the new normal.
Also, more and more systems are able to handle failure without affecting the service level at all. Highly distributed systems like Amazon S3 or Amazon DynamoDB are storing data on multiple nodes which are able to jump in immediately in case of a failure. AWS is offering building blocks allowing you to develop fault-tolerant systems. Also more and more software vendors are adapting to the paradigm. A fault-tolerant system does not require human intervention to stay operating in almost all failure conditions.
Get rid of 24/7 on-call duty to increase the health of yourself and your team. Instead, do on-call duty during working hours or waking hours only. Invest heavily into a self-healing infrastructure or fault-tolerant systems to mitigate the risk of downtime impacting your business.
In case you can justify a 24/7 on-call duty with the safety of life or a similar important reason extend your team with remote workers in different time zones.
Are you doing on-call duty during working hours or wake hours only? marbot supports you and your team to detect and solve incidents on AWS.