Are you the lonely DevOps engineer doing 24/7 on-call? Change it!
Andreas Wittig – 15 May 2019
Are you the only one in your team who takes responsibility for the productive system? Are you carrying your laptop with you even in your free time to be able to fix issues in production? Are you unofficially on-call 24/7?
I’ve been in the same situation. Being the lonely DevOps engineer - even if you are part of a bigger team - can be a burden.
But how to make the change from a one-person show to an on-call team performance like from the picture book? Here are some ideas on how to change your situation.
Team up with one team member when programming or debugging.
- Share your screen and explain what you are doing to your colleague. Ask your colleague for help to avoid mistakes and to find better solutions.
- Guide your colleague through making changes to the infrastructure from his/her machine. Don’t forget to discuss the “why”.
- Watch your colleague and let her/him explain what she/he is doing to you. Give valuable feedback, but only from time to time.
Repeat the process with all of your team.
Learning how to operate a complex cloud infrastructure is scary for the rest of your team. It is critical to take away your colleagues fear of breaking production. Make sure to grant the whole team access to a safe learning environment. For example, an AWS account that is only used to try oneself. Even better, provide a separate AWS account to all colleagues.
Invest in creating and updating documentation of your cloud infrastructure and operations. Doing so may not be your favorite job, but it’s necessary. Observe the questions from your colleagues and improve the documentation accordingly.
- Illustrate the high-level architecture with a diagram. Lucidchart and Cloudcraft are my favorite tools to create architecture diagrams.
- Illustrate the network topology with a figure.
- Describe the different parts of your architecture.
- Describe your backup and recovery strategy.
- Explain where to find monitoring metrics, alarms, and logs.
Are you planning a significant change in production? Did you improve monitoring or logging? Spread the knowledge and organize a Show & Tell meeting. Thirty minutes should be excellent. Don’t forget to reserve 10 minutes for questions from your colleagues.
Being on-call for a production system leaves your team with a queasy feeling. It takes some time to build the confidence of being able to fix any problem. Support your colleagues by providing runbooks guiding them through localize and fix common issues.
A runbook should answer the following questions:
- How to categorize the severity of the incident? For example, by pointing to relevant metrics or logs.
- How to localize the root cause of the failure?
- How to fix the root cause of the incident?
Check out our runbook “ALB UnHealthyHostCount” runbook as an example.
When handing over responsibility for production to your team, the incidents caused by human failure will increase. Set a good example. Don’t blame for human failure. Organize blameless postmortems instead. Help your team to learn from failure. Don’t forget to sensitize your management as well.
Appreciate colleagues who are doing on-call and take responsibility for production.
- Praise the colleague who completed her/his first weekend or night on-call shift.
- Praise the colleague who takes over an extra on-call turn from a sick colleague.
- Award the “on-call engineer of the month” based on the number of fixed incidents.
- Provide a day off for colleagues who excelled oneself during their on-call shifts.
Or think of other gamification that fit your team spirit. Make sure to get support from management for appreciating colleagues doing on-call shifts.
Are you the lonely DevOps engineer doing 24/7 on-call? Change it! There are no one-size fits all solution. But no one besides you will drive the change.
Are you a lonely DevOps engineer? I want to connect. Please contact me!
This blog post is provided by marbot: AWS monitoring & alerting in Slack