loading...

Kubernetes – Introduction to high availability

In order to understand our goals for this chapter, we first need to talk about the more general terms of high availability and scalability. Let’s look at each individually to understand how the pieces work together.

We’ll discuss the required terminology and begin to understand the building blocks that we’ll use to conceptualize, construct, and run a Kubernetes cluster in the cloud.

Let’s dig into high availability, uptime, and downtime.

How do we measure availability?

High availability (HA) is the idea that your application is available, meaning reachable, to your end users. In order to create highly available applications, your application code and the frontend that users interact with needs to be available the majority of the time. This term comes from the system design field, which defines the architecture, interface, data, and modules of a system in order to satisfy a given set of requirements. There are many examples of system design in disciplines from product development all the way to distributed systems theory. In HA, system design helps us understand the logical and physical design requirements to achieve a reliable and performant system.

In the industry, we refer to excellence in availability as five nines of availability. This 99.999 availability translates into specific amounts of downtime per day, week, month, and year. 

If you’d like to read more about the math behind the five nine’s availability equation, you can read about floor and ceiling functions here: https://en.wikipedia.org/wiki/Floor_and_ceiling_functions.

We can also look at the general availability formula, which you can use to understand a given system’s availability:

Downtime per year in hours = (1 - Uptime Availability) x 365 x 24

Uptime and downtime

Let’s dig into what it means to be up or down before we look at net availability over a daily, weekly, and yearly period. We should also establish a few key terms in order to understand what availability means to our business.

Uptime

Uptime is the measure of time a given system, application, network, or other logical and physical object has been up and available to be used by the appropriate end user. This can be an internally facing system, an external item, or something that’s only interacted with via other computer systems.

Downtime

Downtime is similar to uptime, but measures the time in which a given system, application, network, or other logical and physical object is not available to the end user. Downtime is subject to some interpretation, as it’s defined as a period where the system is not performing its primary function as originally intended. The most ubiquitous example of downtime is the infamous 404 page, which you may have seen before:

In order to understand the availability of your system with the preceding concepts, we can calculate using available uptime and downtime figures:

Availability Percentage = (Uptime / (Uptime + Downtime) x 100

There is a more complex calculation for systems that have redundant pieces that increase the overall stability of a system, but let’s stick with our concrete example for now. We’ll investigate the redundant pieces of Kubernetes later on in this chapter.

Given these equations, which you can use on your own in order to measure the uptime of your Kubernetes cluster, let’s look at a few examples.

Let’s look at some of the math behind these concepts. To get started, uptime availability is a function of Mean Time Between Failures (MTBF), divided by the sum of Mean Time to Repair (MTTR) and MTBF combined.

We can calculate MTBF as follows:

MTBF = ‘Total hours in a year' / ‘Number of yearly failures'

And MTTR is represented as follows:

MTTR = (‘Amount of failure' x ‘Time to repair the system') / ‘Total number of failures'

This is represented with the following formula:

Uptime Availability = MTBF/(MTTR + MTBF)
Downtime per Year (Hours) = (1 – Uptime Ratio) x 365 x 24

The five nines of availability

We can look more deeply at the industry standard of five nines of availability against fewer nines. We can use the term Service Level Agreement (SLA) to understand the contract between the end user and the Kubernetes operator that guarantees the availability of the underlying hardware and Kubernetes software to your application owners.

A SLA is a guaranteed level of availability. It’s important to note that the availability gets very expensive as it increases.

Here are a few SLA levels:

  • With an SLA of 99.9% availability, you can have a downtime of:
    • Daily: 1 minute, 26.4 seconds
    • Weekly: 10 minutes, 4.8 seconds
    • Monthly: 43 minutes,  49.7 seconds
    • Yearly: 8 hours 45 minutes, 57.0 seconds
  • With an SLA of 99.99% availability, you can have a downtime of:
    • Daily: 8.6 seconds
    • Weekly: 1 minutes, 0.5 seconds
    • Monthly: 4 minutes, 23.0 seconds
    • Yearly: 52 minutes, 35.7 seconds
  • With an SLA of 99.999% availability, you can have downtime of:
    • Daily: 0.9 seconds
    • Weekly: 6.0 seconds
    • Monthly: 26.3 seconds
    • Yearly: 5 minutes, 15.6 seconds

As you can see, with five nines of availability, you don’t have a lot of room to breathe with your Kubernetes cluster. It’s also important to note that the availability of your cluster is a function of the application’s availability.

What does that mean? Well, the application itself will also have problems and code errors that are outside of the domain and control of the Kubernetes cluster. So, the uptime and availability of a given application is going to be equal to (and rarely if ever equal, given human error) or less than your cluster’s general availability.

So, let’s figure out the pieces of HA in Kubernetes.

Comments are closed.

loading...