There is a moment in every company when 24x7 support is needed. Congrats! The next step is to start building an on-call team. In this article, we'll go through some of the aspects you should consider. We'll keep it small and, in a future article, go deep into each step.
If you haven't had the chance check our previous article from this series on "What is on-call? Why is it important?".
This is a crucial first step. First, you need to stop and think. What does on-call means for us? What do you want to achieve?
Several companies ignore this step. They hope everyone has the exact definition and the same understanding. It never happens. So you should define it and ensure everyone in the company understands it.
Usually, significant problems arise if this is not set correctly in the beginning. The company's default behavior will be to treat on-call with a "fix everything that happens" attitude. Calling someone, or triggering an alarm, because one user is mad that they can't log in. Sure it might be a big problem, but if you have 1 in 1.000.000 users with that problem, is it that important?
So from this step, you should have a set of rules of what on-call is. And most important, what on-call isn't.
On-call is a big jargon on IT. People are afraid of it. It can usually mean waking up in the middle of the night. Not be able to go to the movies. Be always with your computer, ready to go into action.
Not saying these don't happen, but usually, people will make a bigger problem than it is.
So talk with them. Understand their fears and concerns. Don't try to sell them on-call. Just listen to your team.
Considering your service needs and your expected on-call team size, you need to start setting up rotations and shifts.
A shift is the number of hours/days that someone will be on-call. It can be whatever you want and works best for people. It can be one whole week, five weekdays, and one weekend.
A rotation is an algorithm you will use to set how shifts will rotate between people.
These need to be balanced to maintain people in a good state to their "normal" day of work
There are multiple strategies for setting up compensation:
There are no actual rules for setting up compensation. Usually, it's decided based on your expectations and current reality. Every strategy has its set of pros and cons.
This will happen eventually. Something happens that the on-call person is not able to fix on their own. They will need help.
Imagine a security incident when there's a breach in the middle of the night. Would you leave it in the hands of a single person with all the details and processes needed in that situation? Multiple people from multiple teams would need to be brought to solve the crisis.
You need to set up a process for when this happens. What should trigger this, and what steps to follow? Create guidelines for people to follow at those events.
The process can be complex, and you do not want to manage it by hand. You will need tools for:
The reality is that after you set up everything, there's still a lot of work needed.
Keep an eye on how people are feeling. Gather feedback. It's very common for people eventually start getting tired. Understand those signs and adapt.
You need to manage the process. Companies evolve, new systems will come alive, and others will die. Reevaluate your approach from time to time, and adapt to reality.
You won't make it perfect in the first moment, not even in the second. You need to start and do it and then understand the caveats for on-call in your company. Adjust and proactively make changes that improve the process for the company and the people.
This is a simple introduction to the steps to need start running on-call in your company. There’s a lot more to say, and we will go more deeply into each topic, in incoming articles, during the next weeks.
Aggregate and monitor all your third-party service status pages
Monitor all your critical services' official status pages from one centralized dashboard. Instant alerts the moment an outage is detected.