An exercise to learn from any incident that impacts the product or users. In the SRE community, this exercise is called an incident retrospective.
Core delivery team, stakeholder, and support team
Postmortems help to reduce the recurrence of negative incidents
First, let’s define an Incident:
Prior to the session, collect as much information about the incident as you can. For example, you might send a survey to relevant teams requesting information. If teams took notes about the incident while it was in progress, collect these as well.
Moreover, it will be helpful to create a timeline prior to the session and validate it during the postmortem session.
Example of timeline:
Make sure to invite the right people for the postmortem session. For example:
Explain the goal of the postmortem (5 minutes)
You might say something like:
“What caused the recent incident? How did we fix it? Additionally, we want to define action items to avoid this in the future and document it in our postmortem document. We are not looking to find someone to blame.”
Tip: Embrace blameless culture because you want to identify what happened and how to avoid it in the future. Make sure you provide psychological safety for each team member.
Present the incident timeline (5 minutes)
Show the incident timeline and confirm it with the team.
Brainstorming Postmortem Topic (5 minutes)
Ask team to write on post-it or digital workspace:
Quick Topic Clustering (5 minutes)
Ask the team to cluster the post-its based on similar topics.
Discuss each cluster (20 minutes)
Allow enough time for each cluster to be discussed. If any particular clusters need additional time, consider a follow-up session about that specific topic.
Additionally, for clusters in the “what went wrong” column, ask “Why did this happen?” and “How can we reduce the impact if it happens again?”
Tip: consider using “five whys” method. Again, be sure to focus on root causes that are due to the process, not individual blame.
Discuss the action items (15 minutes)
Discuss with the team what actions we can take to solve the incident and avoid it in the future. Define the owner(s) of these actions.
Agree on the postmortem owner(s) (4 minutes)
Decide as a team the postmortem owner, whose responsibilities are to:
Schedule the follow up (1 min)
Agree with the team for a follow up postmortem session to review the actions. The incident should not be closed until the team completes all postmortem actions.
At the end of this exercise, the team will have a clear alignment of what the incident was and how it happened. Moreover, the team will have a set of plans to prevent or reduce the impact of the incident’s recurrence.
Examples of postmortem documents:
Book: Life of a production system incident (coming soon!)