Sie sind auf Seite 1von 6

6 STEPS FOR HACKING POST

MORTEM REPORTING

Mom
MICROSOFT [Company address]
6 STEPS FOR HACKING POST MORTEM
REPORTING
Blameless post-mortems allow us to examine mistakes in a way that focuses on the situational aspects of
a failures mechanism and the decision-making process of individuals proximate to the failure. 1

WHAT IS A POST MORTE M

The engineers at Google describe a postmortem as a written record of an incident, its impact, and the
actions taken to mitigate or resolve it, the root cause, and the follow-up actions to prevent the incident
from recurring.2

While the definition of a post mortem makes it sound like a straight forward process, the simplicity can
belie some important technical and managerial details that must be done correctly if the exercise is to
be an effective one. Without providing post mortems with an effective framework, the underlying
problem is never really resolved. Its like the definition of insanity which is described as doing the same
thing over and over again and hoping for a different outcome.

The goal of this whitepaper is to provide suggestions on the types of tools and frameworks that need to
be introduced in order for IT, Ops or ITSM to institute an effective post mortem culture that is focused
on results. Indeed, post mortems are important for DevOps and ITSM professionals as they allow these
groups to see what worked, what didnt work and how can the team get better.

TO THIS END, WE WILL LOOP AT THE FOLLOWIN G POINTS:

Why are post mortems necessary


What do post mortems allow us to achieve
How can we implement an effective post mortem

WHY ARE POST MORTEMS NECESSARY?

Post mortems are necessary as they give us insight into why an incident happened. They allow us to
deconstruct a particular incident and see what transpired after the critical event and how that can be
improved in the future. Was the problem due to a scheduled or unscheduled incident? When the Sev1
incident occurred, was the right team notified? If the team was notified, did they actually hear the alert or
did the alert just go off as a ping on their smartphone?

Additionally, post mortems are important as they are effective tools for managing the teams SLAs. Even
if your product is sold to other businesses and not to customers, you still have SLAs on keeping the

1
The DevOps Handbook 2016 pg. 274
2
https://landing.google.com/sre/book/chapters/postmortem-culture.html
product up and running. This is even truer if you are a cloud-based service. So, knowing that you have a
five nines level SLA, you know you cannot afford much downtime. With a post mortem, you can see
how an incident effected your SLA with your customers.

Post mortems also allow you to more specifically manage MTTA (mean time til acknowledgement) and
MTTR (mean time til resolution). These are usually the terms that teams manage as they represent the
metrics most tied with resolution effectiveness. Indeed, the greatest contributor to how long it takes for
an issue to be resolved is how long it takes until the issue is acknowledged.

WHAT DO POST MORTEMS ACHIEVE?

Above all, you should make sure your discussions lead to actual change.3

Post mortems, when carried out correctly, can achieve a whole lot that advances the team in the
direction of further progress and IT knowledge. The post mortems are designed to break down sacred
cows and reveal points of truth that might not have been previously recognized. For example, when a
service interruption was identified by the monitoring tool, was the incident only sent to one team
member who then in turn needed to identify a number of other team members which slowed down the
time until team members could respond?

Alternatively, were all team members alerted when the incident occurred such that no one knew who
was going to respond to the alert? This result is equally problematic as there is always the feeling that
some other team member can take care of the issue.

In these two incidents as well as many others, the problems come to light in the course of an effective
post mortem.

HERE ARE THE 6 STEPS TO HACK POST MORTEM REPORTING

Learning from mistakes is something thats often quite difficult to do. Without a framework to help you
do it consistently, it can be haphazard and important details can be overlooked or forgotten.4

Post mortems are both necessary and important to effective incident management as they bring to the
surface how effective your team is at managing critical events. Effective post mortems are not meant to
be blame games or cheap talk. Instead, they are meant as effective management tools to improve the
effectiveness of the team.

HACK #1 BRING IN KEY TECHNOLOGIES AND TAKE ADVANTAGE OF THEIR AVAILA BILITY

As post mortems have become an important part of IT and DevOps culture, it is important to consider
beforehand what technologies team members will need to enable effective post mortems. As opposed

3
https://zapier.com/blog/project-retrospective-postmortem/
4
https://blog.pusher.com/dont-repeat-your-mistakes-conducting-post-mortems/
to a discussion on monitoring tools, this sort of conversation instead looks into what alerting tools, chat
tools, ticketing tools and reporting tools the team will need to remain in contact during an incident.

All three of these instruments create time stamped incidents which are critical for post mortems to run
effectively. ALERTING TOOLS will indicate when an incident arrived to the engineering team and who
responded to it. Was the incident acknowledges, ignored, forwarded or escalated? Also, how long did it
take until the incident was acknowledged? In a robust alert management platform (like OnPage) all this
information is captured.

CHAT TOOLS like HipChat or Slack are where engineers conduct business. These tools are where work
gets done. Importantly, these tools also have a time stamp that will allow concerned parties to see what
happened and when.

TICKETING TOOLS like Jira or Service Now also are time stamped but become the record for when
incidents took place such as when an incident occurred on the server as well as any back and forth that
occurred during the incidents resolution.

A REPORTING TOOL that enables managers and stake holders to review the workloads and busyness of
various teams will provide insight into why teams might be less effective than they ideally should be.
This could be because a particular team is overloaded with alerts and as a result cannot answer all the
alerts they are receiving. Alternatively, a particular faulty piece of infrastructure could be producing an
outsized number of alerts that keeps the team unable to respond to other issues.

HACK 2: ENABLE POST MORTEMS AS SOON AFTER THE EVENT AS POSSIBLE

Memories are shaky. So it is best to enable the post mortem as soon after the event as possible. Team
leaders need to be rigorous about recording details and sharing information

HACK 3: BRING IN INSIGHTS OF TEAM

Make sure the relevant stakeholders and participants are at the post mortem meeting. These
stakeholders are people who might have contributed to the problem. In addition, you will want to
include any people who responded to the problem as well as people who diagnosed the problem. Dont
forget to include an invitation to a representative of the group affected by the problem.

By bringing in this robust group, you will insure that you have the relevant parties at the table who can
identify the relevant issues and help bring resolution to the issues.

HACK 4: CREATE A TIMELINE

If you dont have things written down, it can be hard to follow up on action items5

The first point of action of the post mortem meeting should be to look at the timeline of events. As you
were smart and invested in an incident alert management system, a communications management, a

5
https://zapier.com/blog/project-retrospective-postmortem/
ticket management platform and a reporting platform, you have all the relevant data you need to view
the order in which the events unfolded. The first three tools allow you to see what happened in a step
by step manner. With the reporting capabilities, you will be able to see aggregate data that provides
context to the timeline.

HACK 5: CREATE A FINAL DIGITAL RECORD

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they
are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company
avoid the same error in the future. 6

Important to share this information and make it easily available. Need to publish post-mortems as
widely as possible. Google drive is a good place to post this information. You need to educate other
members of the team as to why the event occurred and commit to changes that will prevent the event
from happening again in the future.

HACK 6: ENABLE POST MORTEMS FOR SUCCESSFUL EVENTS AS WELL

While this whitepaper has primarily focused on creating post mortems after critical Sev1 or Sev2
incidents, it is equally important to create post mortems on successful events as well. Whereas post
mortems are used to identify things that went wrong and why, post mortems should also be used to
identify, in the case of successful events, what could have been done to make the outcome even more
desirable. Additionally, post mortems for successful events can highlight what went right and why.
Successful projects, on the other hand, are still ripe with errors, inefficiencies, near misses etc. but have a
positive context, which helps people relax when addressing issues. They are also likely to have best
practices and novel ideas, which is as valuable.7

CONCLUSION

Effective post mortems are equal parts technology and management. The technology your team brings
on needs to be able keep track of and log the events that took place from the time the event began until
the incident was resolved. On the management side, there needs to be the processes in place from the
acknowledgement of the event to setting up the meeting to bringing in the relevant stake holders. By
combining these components, post mortems stand a chance of being successful.

In the end though, effective post mortems are a process and take time to perfect. Practice will dictate
which team members are most effective at providing perspective and insight. Time will also allow teams
to determine which technologies take the most time to manage when problems arise.

6
https://codeascraft.com/2012/05/22/blameless-postmortems/
7
https://www.linkedin.com/pulse/20141001093119-69047-5-tips-on-running-effective-postmortems
The important point though is to start practicing post mortems as they are key to continued growth of
the company and its leaders.

ABOUT ONPAGE

OnPage is a cloud-based, industry leading smartphone application for high-priority, real enterprise messaging.
OnPage provides critical alerts to Managed Service Providers based on notifications from RMM or PSA system for
faster incident resolution.

Using OnPage you get instant visibility and feedback on alerts. As part of your IT service management, you can
track alert delivery, ticket status, and responses.

As a result, you will improve MTTR and better manage your clients ecosystem by decreasing service interruptions.
As an organization, you will improve responsiveness to SLAs and lower your and your clients costs.

TO LEARN MORE, VISIT OUR WEBSITE OR CALL: ONPAGE.COM/CONTACT-US 781-916-0040

Visit iTunes or Google Play from your smart phone or tablet to download the OnPage app.

Das könnte Ihnen auch gefallen