Your contract with Dataweavers provides for certain uptime SLAs.
This is typically either 99.9%, 99.95% or 99.99%.
By definition, this means we allow a certain number of minutes of unplanned outages per month. See https://uptime.is/ for the calculation of outage allowance across days, months and years.
For example, a 99.9% (three nines) SLA would allow downtime of:
- Daily: 1m 26s
- Weekly: 10m 4s
- Monthly: 43m 49s
- Quarterly: 2h 11m 29s
- Yearly: 8h 45m 56s
An outage, by general definition, is when the website fails to return a 200 response and generate the HTML that you expect. Our tracking systems are intelligent, test from multiple locations and networks around the globe. It is also designed to tolerate of transient issues such as bad DNS resolution/propagation, general internet congestion, weak mobile connections and even device-level issues that sometimes occur.
This alleviates false positives and avoids the concept of alert fatigue.
However, we do sometimes see micro-outages in the following scenarios:
- the platform is restarting (usually to recover memory or auto-heal)
- the platform is scaling (to meet demand or cost ceilings)
- the cloud platform (i.e. Microsoft themselves) is performing security patching
- internal maintenance where we are performing maintenance and feature releases out of core business hours to improve and maintain service.
These scenarios typically only affect a very small number of connections and sometimes the tracking system is one of those connections. From a customer perspective, except where the outage is longer than expected, during the scenarios above, the platform will failover to the secondary endpoint (instance or region). This means there is typically no impact to the majority of end users.
The tracking platform we use records these events, because we are tracking a primary connection to the primary server, and the website check frequency is every 1 minute. This means that if the first connection to the primary server fails, then it will take another 1 minute for it to check again for another connection to the servers.
This does not mean that the website is inaccessible during that time, it simply means that our health checks are set to an interval that allows us to meet the SLA.
Outage alert process
Our support engineers on duty in timezone will immediately acknowledge an outage internally, then wait up to 5 minutes for the platform to report it as "Green". Should it stay Orange (Degradation) or Red (Outage) and our failover and high availability systems do not recover the website, we will start investigating and complete the procedures aligned to our Disaster Recovery mandate.
We will notify the client immediately in these cases:
- where an outage is persistent
- where an outage impacts customers and therefore risks not meeting our SLA
- in some rare cases where it is related to a downstream system.