What do uptimes, outages and the alert process look like?

Lynsey Jackson
Lynsey Jackson
  • Updated

Uptime SLAs

Your contract with Dataweavers provides for certain uptime SLAs.

This is typically either 99.9%, 99.95% or 99.99%

By definition, this means we allow a certain number of minutes of unplanned outages per month. See https://uptime.is/ for the calculation of outage allowance across days, months and years. 

For example, a 99.9% (three nines) SLA would allow downtime of:

  • Daily: 1m 26s
  • Weekly: 10m 4s
  • Monthly: 43m 49s
  • Quarterly: 2h 11m 29s
  • Yearly: 8h 45m 56s

Tracking outages

An outage, by general definition, is when the website fails to return a 200 response and generate the HTML that you expect. Our tracking systems are intelligent, test from multiple locations and networks around the globe. It is also designed to tolerate of transient issues such as bad DNS resolution/propagation, general internet congestion, weak mobile connections and even device level issues that sometimes occur.

This alleviates false positives and avoids the concept of alert fatigue.

However, we do sometimes see micro-outages in the the following scenarios:

  • the platform is restarting (usually to recover memory or auto-heal)
  • the platform is scaling (to meet demand or cost ceilings)
  • the cloud platform (i.e. Microsoft themselves) is performing security patching
  • internal maintenance where we are performing maintenance and feature releases out of core business hours to improve and maintain service.

These scenarios typically only affect a very small number of connections and sometimes the tracking system is one of those connections. From a customer perspective, except where the outage is longer than expected, during the scenarios above, the platform will failover to the secondary endpoint (instance or region). This means there is typically no impact to the majority of end users. 

The tracking platform we use records these events, because we are tracking a primary connection to the primary server, and the website check frequency is every 1 minute. This means that if the first connection to the primary server fails, then it will take another 1 minute for it to check again for another connection to the servers.

This does not mean that the website is inaccessible during that time, it simply means that our health checks are set to an interval that allows us to meet the SLA.

Outage alert process

Our support engineers on duty in time zone will immediately acknowledge an outage internally, then wait up to 5 minutes for the platform to report it as "Green". Should it stay Orange (Degradation) or Red (Outage) and our failover and high availability systems do not recover the website, we will start investigating and complete the procedures aligned to our DEFCON mandate.

We will notify the client immediately in these cases:

  1. where an outage is persistent
  2. where an outage impacts customers and therefore risks not meeting our SLA
  3. in some rare cases where it is related to a downstream system.

Reporting outages

The tracking platform is configured to be sensitive and designed to allow our ops to monitor issues and potential issues that might occur. Typically, we report this once a month to you in the Monthly Service Review.

Premium support

If your business requires extended support coverage to maintain continuity and visibility and the Standard 99.90% SLA is not sufficient, you may like to consider our Premium offering. This will provide a higher 99.99% SLA and a dedicated technical contact with customised emergency/P1 procedures and reporting.

Please get in touch with us on support@dataweavers.com or via your account manager to discuss the best option for you.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request