Much like in other production environments, the production of cloud services is based on and orchestrated by a plethora of tools—making part of cloud services’ overall cloud infrastructure. Given how cloud services are as complex as they are intricate, a vast range of detailed steps need to be performed in a certain order for the production environment to run smoothly, whether it’s carrying out maintenance procedures, updates and upgrades, or resolving issues to prevent downtime.
This is especially true for the Network Operations Center (NOC) which sits at the heart of any cloud service’s overall production environment. Not only is it critical that all steps are performed well, but the order in which they are taken matters just as much. For this purpose, the strongest NOC engineers almost always rely on a work concept known as a “runbook”, sometimes also referred to as a “playbook” (yes, like in sports).
From Physical to Digital Runbooks
The history of the runbook doesn’t necessarily begin and end with computer systems. Other production environments often rely on runbooks as well. What runbooks do for any type of production environment is simplify and order the entire production cycle by breaking it down into a concrete action plan.
In the case of computer systems and networks, physical or digital runbooks are used to document, synthesize, and explain all the routine operations that are required in order for engineers and other technical staff to carry tasks out successfully. NOC engineers as well as R&D, SRE, and DevOps teams, system administrators, and other team members can rely on a runbook to quickly know what to do and when. In short, runbooks are the manuals of every production process.
While runbooks have always existed in physical form, the onset of computer systems has also brought about the creation of the digital runbook. Many technology companies, and especially cloud service providers (who need to be on top of their game when it comes to employing some of the latest technologies), have converted their runbooks into digital resources. Many also go deeper by automating their runbook to achieve higher qualities of operational efficiency (which we’ll get into a bit later).
Running a Network Operations Center
Now that we understand what runbooks are, let’s dive further into NOC engineering. Sitting at the heart of every production environment for cloud infrastructure, the NOC is where customer solutions are being monitored and maintained. Because cloud service customers are dependent on the provider’s data center for the deployment and use of cloud services, NOC engineers work around the clock to ensure maximum uptime.
A typical day in the life of a NOC engineer consists of some of the following tasks:
- Assessing production environment, network connectivity, and service availability
- Using troubleshooting mechanisms and escalation procedures for efficiency
- Providing corrective action to debug and/or resolve issues
- Assisting in the management of customer ticketing and reporting
NOC engineers are on the first line of duty to respond in case something goes wrong. They are responsible for the availability of the cloud services provided.
In an ideal world, cloud services are available 24/7 both for external and internal customers. Hence, taking care of all the complex scenarios demands a rigorous procedure which NOC engineers can’t reasonably follow without a clear framework.
Why Using a Runbook Matters
For NOC engineers to perform all tasks to their best abilities, they need a runbook. Letting NOC engineers run the NOC for your production environment without the use of a runbook can be catastrophic. From not documenting critical steps and procedures to risking forgetfulness and error, not using a runbook will negatively affect your services’ quality and damage service availability for your customers. In order to avoid this scenario, it is crucial for your organization (particularly if you are a cloud service provider) to streamline NOC operations and document critical knowledge.
To establish easy-to-use and quick-to-implement runbook principles, do the following:
- Don’t let NOC engineers perform tasks without documenting them. Every step of the way needs to be written down and should be stored.
- Centralize knowledge. Make sure that every documented step or procedure is stored or at least reallocated to one central location in order to build out a knowledge center.
- Use engineer-friendly words and images. Write runbook documentation that targets, and is comprehensible to, NOC engineers using the right language and visuals.
Hybrid Runbook Automation
Once you have a runbook, you’ll notice there may be many procedures and steps that can be automated. We recommend creating a hybrid model, in which automation is implemented where possible with human monitoring and intervention to ensure nothing is overlooked and avoid errors. This not only reduces the chance of human error, but also reduces the workload and stress on the NOC team.
Creating a runbook for NOC engineers where they can document, centralize, and access critical knowledge is essential for effective NOC operations. NOC engineers will become more proficient at their job, because a runbook enables them to have a consistent reference point for work, while your organization will never rely on any specific team member for much-needed knowledge.