When an incident occurs and is taken care of, it’s still important to go back and understand why it happened in order to repair and prevent future errors, and improve operations activities. A post-incident report is called a postmortem report (yes, morbid), and basically describes the incidents, the effects it had, how to deal with the fallout, and how to repair the errors. Put simply, it analyzes the incident to find out what happened, just like an actual postmortem inspects a corpse to discover the cause of death.
A postmortem allows the team to see what went wrong and how to repair and improve the project. It not only makes it easier to locate and manage errors, but also allows the team to contribute to overall improvement of the project and minimize the risk of the same incident occurring, or causing the same level of damage, in the future.
What Does a Postmortem Look Like?
The actual visual design of a report varies based on the system being used, but the information it contains are generally the same. The basic outline of a postmortem generally begins with an overview of both the general situation and the incident in particular. This is followed by a more detailed account of the incident, its possible causes, and the effect it had on users. Here, details such as date and time, location most impacted, and number of users affected is specified.
This is often followed by a section analyzing the root causes of the issue and why they happened, as well as any actions that may have occurred after the beginning of the incident that impacted how it played out.
The next section is called “Remediation and Prevention” and describes the steps taken to resolve the issue. At this point, a basic postmortem report ends, but optional additions such as a timeline or table breaking down the effects of the incident can be extremely helpful in making the report more readable and comprehensible.
Postmortem reports are critical, which is why we’ve put together a few best practices we recommend to maximize their benefits.
Have a Meeting and Take Notes
If possible, schedule a meeting as soon as possible after the incident to give your team time to analyze the causes and effects of the incident, as well as possible solutions and resolutions.
Have you ever wondered why so many tech giants, including Uber and Google, generally publish a postmortems within 48 hours of an incident? Putting your postmortem out there when all the data is still on your team’s minds makes it a lot easier to identify issues and solve them. Another perk of doing this is that it placates customers before they have a chance to stew in their frustrations.
Build Clear Timelines
The timeline should only serve to give readers a quick overview of the most important aspects of the incident. When the timeline gets bogged down with unnecessary details, it becomes difficult to read, thus defeating its purpose. Yes, the timeline may make up half the word count of a postmortem, but make sure it’s made up of the words that count.
Keep It Visual
Using visuals such as graphs makes it easier for engineers to understand what went wrong and ensure it doesn’t happen again. Graphs serve as a good, quick summary of the situation, and can make it simpler for readers to understand even the more complex aspects of the data at just a quick glance, allowing even newer members of the team to get up to date on the situation. Don’t use graphs just to use them though – sometimes a postmortem report can be short and simply, and there’s no need to overcomplicate things.
Although it may seem difficult to avoid blame when postmortems identify specific issues that have led to the incident, blame is unproductive and a tense atmosphere can put people ill at ease and make them less likely to put forward new ideas or novel solutions. To achieve a blameless report, try to refrain from using accusative language in both the meeting and the report. When the atmosphere is non-accusative and accepting, team members feel more comfortable brainstorming and sharing their solutions.
Build a Narrative
A postmortem report is like a story, with all of the components needed to make it a linear and understandable narrative. For example, give proper context for the incident so that the readers can understand how the situation came about. Include a logical timeline of events which can serve as the “plot” or the overall description of the incident. Last and possibly most importantly, make sure your report has a “resolution”- actions which can be taken to rectify the issues.
Connect Your Postmortem to a Runbook
A good runbook allows any other team members to understand the system and manage the debugging. Operators working “in the trenches” use postmortems daily, and trying to find previous ones can waste valuable time and resources. When using a platform like XiteIt, for example, your postmortem reports are connected to your runbook, allowing future project contributors and any other relevant parties’ access to any past postmortems, so that they can look as far back as necessary to avoid repeat incidents and gain insights into where operations and processes need to be improved. This allows all team members to contribute to the troubleshooting process.
Use a Postmortem Template
Building a postmortem can be fairly complex. Much of its use depends on it being easy to understand with a clear and coherent structure. Although the structuring is very rigid and can be hard to get into, using a good template removes the headache of figuring it out on your own. There are plenty of templates available online, but many of them are paid.
When you’re looking to identify the causes of the incident, make sure you’re going in with the search for a solution in mind. After all, pointing out mistakes is not constructive if you don’t have a solution for them. On a similar note, make sure to get as much input as possible from any qualified team members – you never know where the next revolutionary answer is going to come from.
Build a Plan of Action
This point ties in to our last point – once you’ve found a solution to your issues, make sure you have a plan of action. Theoretical solutions are useless without a plan of action with which to execute them. Don’t let this postmortem go until your team has found solutions to the problems and know how to get out there and solve them.
Using these tips can make your postmortem meetings more productive and your report more illuminating. You can’t avoid errors and incidents altogether, and that’s ok. Failure is an important step on the way to success and can be turned into a learning and building experience if it’s approached correctly. Take advantage of incidents to improve your projects and foster an atmosphere of openness and creativity among your team.