Uptime Service Level Agreements

Your contract with Dataweavers provides for certain uptime SLAs.

This is typically either 99.9%, 99.95% or 99.99%.

By definition, this means we allow a certain number of minutes of unplanned outages per month. See https://uptime.is/ for the calculation of outage allowance across days, months and years.

For example, a 99.9% (three nines) SLA would allow downtime of:

Daily: 1m 26s
Weekly: 10m 4s
Monthly: 43m 49s
Quarterly: 2h 11m 29s
Yearly: 8h 45m 56s

Tracking outages

An outage, by general definition, is when the website fails to return a 200 response and generate the HTML that you expect. Our tracking systems are intelligent, test from multiple locations and networks around the globe. It is also designed to tolerate of transient issues such as bad DNS resolution/propagation, general internet congestion, weak mobile connections and even device-level issues that sometimes occur.

This alleviates false positives and avoids the concept of alert fatigue.

However, we do sometimes see micro-outages in the following scenarios:

the platform is restarting (usually to recover memory or auto-heal)
the platform is scaling (to meet demand or cost ceilings)
the cloud platform (i.e. Microsoft) is performing security patching
internal maintenance where we are performing maintenance and feature releases out of core business hours to improve and maintain service.

These scenarios typically only affect a very small number of connections and sometimes the tracking system is one of those connections. From a customer perspective, except where the outage is longer than expected, during the scenarios above, the platform will failover to the secondary endpoint (instance or region). This means there is typically no impact to the majority of end users.

The tracking platform we use records these events, because we are tracking a primary connection to the primary server, and the website check frequency is every 1 minute. This means that if the first connection to the primary server fails, then it will take another 1 minute for it to check again for another connection to the servers.

This does not mean that the website is inaccessible during that time, it simply means that our health checks are set to an interval that allows us to meet the SLA.

Outage alert process

Our support engineers will immediately acknowledge an outage internally, should it stay Orange (Degradation) or Red (Outage) and our failover and high availability systems do not recover the website, we will start investigating and complete the procedures aligned to our Disaster Recovery processes if requried.

We will notify the client immediately in these cases:

where an outage is persistent (longer than 3 minutes)
where an outage impacts customers and therefore risks not meeting our SLA
where it is related to a downstream system (rare instances).

Related articles:

Related to

Uptime Service Level Agreements, outages and the alert process

Uptime Service Level Agreements

Tracking outages

Outage alert process

Was this article helpful?

Search

Uptime Service Level Agreements

Tracking outages

Outage alert process

Was this article helpful?

Related articles