Search

A Guide to Incident Severity Levels

Maintaining IT infrastructure is a consistent challenge for system administrators, site reliability engineers (SREs), and supporting developers and technicians. An endless variety of factors can impact system performance, cause outages, or impact customer experience.

On top of that, not all incidents are equal. A system outage affecting 10 percent of your users is different from an outage impacting 90 percent, or worse, 100 percent. One way to facilitate an efficient response is using a transparent system of incident severity levels that teams can reference easily. Using a pre-defined severity scale with reference codes, relevant team members can quickly interpret the issue. This helps minimize incident response time and strengthens efforts to coordinate remediation throughout the response team.

Setting this severity level system in place ahead of time helps teams quickly understand the amount of urgency required in a situation and enable effective prioritization. To start, let’s explore how to define your severity levels and examine some popular systems. Then, we’ll discuss how your organization can put a strategy in place that works best for you so your team feels confident to react quickly and appropriately when incidents strike.

Defining Your Severity Levels

Your team needs to find and apply a common language to communicate efficiently. Whatever you pick, you should ensure your whole team understands the chosen language and the reasoning behind it to comprehend each incident on a high level.

Classifying your incident’s severity level helps ensure a consistent response. As well, severity levels help avoid confusion about how to proceed. For instance, severe incidents may require an all-hands-on-deck response and contacting team members on a holiday.

You can employ various strategies to come up with a common language. What works best for one organization may not be ideal for another. The right language for your incident response team depends on factors such as your organization’s size, the nature and frequency of incidents, and your team’s composition.

One strategy is to reference the severity levels from another team or company. You should pick a similar source—for example, a company in the same industry. Even if you may not benefit from using another company’s exact system, their levels can still form the basis for yours.

You can seek examples from other industries. For example, you can borrow terminology from the aviation industry and distinguish between fatal, major, and minor incidents. However, ensure everyone on your team will understand these terms.

Brainstorming helps, too. Your team may identify various severity levels more easily through a spider diagram or mind map by looking at the resulting clusters. The advantage of such a system is that it arises from a collective effort, and it already implies a decision tree to classify incidents.

Most importantly, ensure everyone understands the wording. You can run some hypothetical incident response scenarios by team members as a stress test. If their severity levels align, you have successfully established the common language. If not, the severity levels need further refinement.

Ranking Severity

Typically, organizations adopt three or five severity levels. These usually follow a pattern like the ones below. A five-level system typically looks something like this:

  • SEV 1: A critical problem affecting a significant number of users in a production environment. The issue impacts essential services, or the service is inaccessible, degrading customer experience.
  • SEV 2: A severe problem affecting a limited number of users in a production environment, degrading customer experience.
  • SEV 3: A not-so-major incident that causes errors, excessive load, or minor problems for customers in a production environment.
  • SEV 4: A relatively minor problem that affects customer experience, but that doesn’t substantially degrade service functionality.
  • SEV 5: A low-level problem that causes minor errors—such as formatting or display problems—that don’t degrade usability.

A three-level system could look like this:

  • Priority 1 (P1): A significant incident that has a broad impact. You should repair the problem as soon as possible to stop losing money, keep customers happy, and maintain your company’s good reputation.
  • Priority 2 (P2): A medium-level incident that may not directly cause lost revenue, but that may escalate if you don’t act swiftly.
  • Priority 3 (P3): A low-level incident that has almost no chance of reducing revenue. Customer experience may be degraded, but not enough to make them switch to a competitor.

In both ranking systems, we see that the impact on the business mainly determines the ranking, though they use different language—severity versus priority—to label levels. The critical question is: When an incident in that category appears, how much does it impact the company’s customer base, revenue, reputation, and other considerations?

Consider if the repercussions are only during the incident, or if the impact will likely persist past the incident’s resolution. For example, will customers lose trust in the company and take their business elsewhere? Severity often boils down to whether impact is permanent and irreversible or almost exclusive to the time of the incident.

Getting Into the Details

The SEV and Priority structures rank more impactful incidents with a lower number. This order is pure convention, and your team may reverse it. Or, you may even want to start at zero.

Using a prefix like “SEV” can help communicate the incident without much explanation. If somebody sends an email like, “We’ve encountered a SEV 1 incident” it directly communicates the background. In contrast, writing, “We’ve encountered a 1” leads to confusion, while “we’ve encountered an incident in our IT infrastructure categorized as level 1” is verbose and even a bit foggy.

While some systems are simply numbered (“0” or “1”) or follow a naming scheme (“SEV 1” or “P1”), other systems rely on words such as “critical” or “high.”  The specific terminology isn’t so important, as long as everyone within the company uses the same terms. It’s much more important to consistently use a matching definition of what incidents the given level indicates.

Teams may be tempted to define and implement these severity levels according to their needs. However, all stakeholders should come together to agree on the terms’ definitions. Then, when an incident occurs, everyone from the responders to management understand what is at stake and how to respond.

Conclusion

Your first step toward ensuring effective incident responses is to properly define the severity levels and get everyone on the same page. The highest priority is internal consistency. Listen to other experts and check out best practices, but be practical. Create a set of definitions that works best for your team.

Along with using well-defined severity levels, automation can work in the background to improve your incident response’s efficiency. From proactively managing issues to alerting all-hands-on-deck to take care of a SEV 1, service reliability platforms like xMatters help teams automate incident response to resolve issues as quickly as possible. Explore xMatters today to see how it can help maintain your infrastructure’s reliability.

If you’re interested in developing expert technical content that performs, let’s have a conversation today.

Facebook
Twitter
LinkedIn
Reddit
Email

POST INFORMATION

If you work in a tech space and aren’t sure if we cover you, hit the button below to get in touch with us. Tell us a little about your content goals or your project, and we’ll reach back within 2 business days. 

Share via
Copy link
Powered by Social Snap