How to Establish a Good SLA? Let’s Think Through It with Real Examples!
Introduction
This article explains what SLAs are and shares examples of good and bad SLAs from real projects. It is written with junior engineers and ITSM practitioners in mind. Because the author has most often been on the vendor/contractor side, the writing may lean toward that perspective — but the goal is to provide content that is useful to clients and procurers as well.
What Is an SLA?
An SLA (Service Level Agreement) is a commitment between contracting parties that defines objective, measurable quality standards for the IT services being provided. It is a contract in which the service provider guarantees it will deliver services to a specified standard, and it is used to concretely understand and manage service quality after deployment.
SLAs typically cover the scope of outsourced work, division of responsibilities, service levels, responses to outcomes, and operating rules. They are generally prepared as an attachment to the main contract. In operations and maintenance in particular, they serve as a framework for numerically measuring and evaluating incident response and operational quality, and for continuously maintaining and improving that quality. It is considered best practice to present the SLA as early as the procurement stage.
What You Must Consider When Designing an SLA
Unlike KPIs, an SLA is a contractual obligation with your client. If an SLA target is missed, you may be required to pay damages under the terms of the contract. That said, if you set the targets too loosely, you will be unable to deliver the full value of the system. You need to set a minimum threshold that maximizes the value delivered — while also making sure you do not end up regretting, after operations begin, that the SLA you agreed to was completely unrealistic.
The following sections walk through some example SLA clauses. Treat them as if a client has just presented them to you, and consider whether each one is acceptable to sign.
Case Studies with Commentary
Example 1: Monthly availability shall be 99.0%.
At 99.0% monthly availability, the permitted downtime is 7.2 hours per month.
Does that sound short? Or long?
First, you need to understand the characteristics of the system in question.
If the system is mission-critical — running 24/365 with significant societal impact if it goes down — then this SLA is actually quite lenient. On the other hand, if the system only operates on weekdays from 9 AM to 6 PM and is completely unused on holidays and weekends, there is plenty of time for maintenance, so 99.0% is less of a concern as long as the system runs stably during business hours.
You also need to factor in your team’s capacity and structure.
If you have access to unlimited skilled operations and maintenance staff, you may be able to resolve incidents quickly. In reality, though, constraints are common. Even for a system that only runs 9-to-6 on weekdays, if you have only two ITSM staff with uneven skill levels, hitting this SLA can become difficult.
For example, imagine a system upgrade is completed overnight, and the system comes back online in the morning. After a while, an incident occurs. If the on-duty daytime ITSM engineer lacks deep experience, incident resolution will be slow. Even if the more skilled engineer — who has already left for the day — rushes back in to help, the incident could drag on. Humans cannot work indefinitely without rest, and in the worst case, both engineers could burn out at the same time.
You need to assess response plans, skill mapping, pre-prepared incident scenarios, and contingency plans honestly and determine whether meeting the SLA is realistically achievable.
Example 2: Incident reports shall be submitted within 20 minutes of the incident occurring.
How does this one look?
Real-world SLAs would normally contain more detail, but as written, this clause is highly ambiguous and difficult to achieve. Here is why.
No definition of the receiving party’s availability. During business hours when the client is in the office, reporting is straightforward. But if an incident occurs in the middle of the night, you need to confirm whether the client has an after-hours team in place to receive the report.
No clear definition of “incident occurrence.” Is the clock measured from when the incident actually started, or from when it was detected? The answer significantly affects the available lead time.
The 20-minute window needs careful examination. What level of detail is the client expecting in that initial report? If it is simply an error message and timestamp, 20 minutes is achievable. But if the report must include a preliminary root cause or an analysis of the underlying issue, 20 minutes is simply not enough — that kind of content requires review by a subject matter expert or senior before it goes out.
The reporting method also matters. Phone, email, fax — all are options, but none is defined. A phone call gets the message across quickly, but what if the recipient is in a meeting? Email is reliable, but without a designated address and template, your message may get buried among 100+ daily emails. And fax? Some clients may not even know how to use one.
As with Example 1, you need to consider the team’s capacity, stamina, hours of coverage, and communication method. A useful mental exercise: if an actual incident happened right now, what would you do step by step? Walk through that scenario when evaluating whether an SLA is achievable.
Example 3: Within 72 hours of a vendor releasing patch information, a report on whether to apply the patch shall be submitted. The patch shall be applied to the system within one week of its release.
What do you make of this one?
The author’s gut reaction is that this is not impossible, but it heavily depends on the system’s characteristics and the strength of the supporting team.
First, let us distinguish between two types of patches: security patches and availability patches. Security patches address vulnerabilities; without them, the system has exploitable security gaps that can prevent service delivery. Every system needs to respond to security patches urgently.
Availability patches are a different matter. In April 2023, ANA’s core passenger system experienced an outage in which tickets could not be issued. In simple terms, the root cause was that a bug-fix patch had not been applied. This is a reminder that leaving patches unapplied can eventually result in a known bug being triggered.
That said, there is a common argument that a stable, running system does not need to be patched unnecessarily. Any patch application should be tested in a debug or staging environment first, and only promoted to production after confirming no issues. Depending on your system, assessing an availability patch and verifying it in a staging environment can require considerable time. Since patches are released without warning, there may be scheduling conflicts if another team is already using the staging environment. And applying the patch to production during business hours requires client coordination; for a 24/365 system, the maintenance window may be extremely narrow.
Discovering and tracking availability patches is also labor-intensive. Some vendors, under premium support contracts, will proactively notify you of relevant patches and even assess whether they are safe to apply given your configuration — but such contracts come at high cost. For open-source components, you may need to search for patches yourself. Managing all of this properly requires a well-maintained SBOM (Software Bill of Materials) and clear operational procedures.
In summary, the SLA in this example is achievable — but only with the right team structure, sufficient budget, a clear patching policy agreed upon with the client, and a thorough understanding of the system’s characteristics.
Example 4: When an inquiry is received at the service desk, the response shall be completed within one business day.
The final example. Any guesses?
After the scrutiny applied to the previous examples, you can probably already sense the ambiguities — and there are several.
The inquiry intake method is not defined. Agreeing on a channel (email, phone, ticket system, etc.) is a prerequisite.
“Response completed” needs a clear definition. Does it mean the issue has been fully resolved? Or does it mean a ticket acknowledging receipt has been issued? These two definitions carry very different difficulty levels.
For instance, a user reports they cannot log in. The service desk escalates the issue internally to ITSM, who must investigate and resolve it — all within one business day. If the problem is caused by a global outage at a third-party cloud provider, resolution may be completely out of your hands. (In such cases, cloud outage handling is often defined under a separate SLA.)
The time frame is also ambiguous. Does “within one business day” mean within 24 hours of receiving the inquiry, or within the same calendar business day? Consider this: if the inquiry arrives at 5:55 PM and business hours end at 6:00 PM, same-day resolution — without overtime — is simply not possible.
Closing Thoughts
An SLA is a critical contract that defines IT service quality through objective, measurable criteria agreed upon with the client. This article used concrete examples — covering availability, incident reporting windows, patch application, and service desk response — to illustrate why designing a realistically achievable SLA matters.
An SLA that is too strict or too loose will undermine the ability to deliver appropriate value. The key to stable operations and high client satisfaction is setting service levels that genuinely reflect the system’s characteristics, the operational team’s capacity, and the available resources.
A well-crafted SLA is what makes IT service management sustainable for both the client and the provider.
References
- Osaka City SLA Guidelines for Information System Procurement — Toward Appropriate Procurement of Information Systems https://www.city.osaka.lg.jp/ictsenryakushitsu/cmsfiles/contents/0000671/671504/SLA_Guideline.pdf