Kubernetes – Health checks

Kubernetes provides three layers of health checking. First, in the form of HTTP or TCP checks, K8s can attempt to connect to a particular endpoint and give a status of healthy on a successful connection. Second, application-specific health checks can be performed using command-line scripts. We can also use the exec container to run a health check from within your container. Anything that exits with a 0 status will be considered healthy.

Let’s take a look at a few health checks in action. First, we’ll create a new controller named nodejs-health-controller.yaml with a health check:

apiVersion: v1 
kind: ReplicationController 
metadata: 
  name: node-js 
  labels: 
    name: node-js 
spec: 
  replicas: 3 
  selector: 
    name: node-js 
  template: 
    metadata: 
      labels: 
        name: node-js 
    spec: 
      containers: 
      - name: node-js 
        image: jonbaier/node-express-info:latest 
        ports: 
        - containerPort: 80 
        livenessProbe: 
          # An HTTP health check  
          httpGet: 
            path: /status/ 
            port: 80 
          initialDelaySeconds: 30 
          timeoutSeconds: 1

Note the addition of the livenessprobe element. This is our core health check element. From here, we can specify httpGet, tcpScoket, or exec. In this example, we use httpGet to perform a simple check for a URI on our container. The probe will check the path and port specified and restart the pod if it doesn’t successfully return.

Status codes between 200 and 399 are all considered healthy by the probe.

Finally, initialDelaySeconds gives us the flexibility to delay health checks until the pod has finished initializing. The timeoutSeconds value is simply the timeout value for the probe.

Let’s use our new health check-enabled controller to replace the old node-js RC. We can do this using the replace command, which will replace the replication controller definition:

$ kubectl replace -f nodejs-health-controller.yaml

Replacing the RC on its own won’t replace our containers because it still has three healthy pods from our first run. Let’s kill off those pods and let the updated ReplicationController replace them with containers that have health checks:

$ kubectl delete pods -l name=node-js

Now, after waiting a minute or two, we can list the pods in an RC and grab one of the pod IDs to inspect it a bit deeper with the describe command:

$ kubectl describe rc/node-js

The following screenshot is the result of the preceding command:

Description of node-js replication controller

Now, use the following command for one of the pods:

$ kubectl describe pods/node-js-7esbp

The following screenshot is the result of the preceding command:

Description of node-js-1m3cs pod

At the top, we’ll see the overall pod details. Depending on your timing, under State, it will either show Running or Waiting with a CrashLoopBackOff reason and some error information. A bit below that, we can see information on our Liveness probe and we will likely see a failure count above 0. Further down, we have the pod events. Again, depending on your timing, you are likely to have a number of events for the pod. Within a minute or two, you’ll note a pattern of killing, started, and created events repeating over and over again. You should also see a note in the Killing entry that the container is unhealthy. This is our health check failing because we don’t have a page responding at /status.

You may note that if you open a browser to the service load balancer address, it still responds with a page. You can find the load balancer IP with a kubectl get services command.

This is happening for a number of reasons. First, the health check is simply failing because /status doesn’t exist, but the page where the service is pointed is still functioning normally between restarts. Second, the livenessProbe is only charged with restarting the container on a health check fail. There is a separate readinessProbe that will remove a container from the pool of pods answering service endpoints.

Let’s modify the health check for a page that does exist in our container, so we have a proper health check. We’ll also add a readiness check and point it to the nonexistent status page. Open the nodejs-health-controller.yaml file and modify the spec section to match the following listing  and save it as nodejs-health-controller-2.yaml:

apiVersion: v1 
kind: ReplicationController 
metadata: 
  name: node-js 
  labels: 
    name: node-js 
spec: 
  replicas: 3 
  selector: 
    name: node-js 
  template: 
    metadata: 
      labels: 
        name: node-js 
    spec: 
      containers: 
      - name: node-js 
        image: jonbaier/node-express-info:latest 
        ports: 
        - containerPort: 80 
        livenessProbe: 
          # An HTTP health check  
          httpGet: 
            path: / 
            port: 80 
          initialDelaySeconds: 30 
          timeoutSeconds: 1 
        readinessProbe: 
          # An HTTP health check  
          httpGet: 
            path: /status/ 
            port: 80 
          initialDelaySeconds: 30 
          timeoutSeconds: 1

This time, we’ll delete the old RC, which will kill the pods with it, and create a new RC with our updated YAML file:

$ kubectl delete rc -l name=node-js-health
$ kubectl create -f nodejs-health-controller-2.yaml

Now, when we describe one of the pods, we only see the creation of the pod and the container. However, you’ll note that the service load balancer IP no longer works. If we run the describe command on one of the new nodes, we’ll note a Readiness probe failed error message, but the pod itself continues running. If we change the readiness probe path to path: /, we’ll again be able to fulfill requests from the main service. Open up nodejs-health-controller-2.yaml in an editor and make that update now. Then, once again remove and recreate the replication controller:

$ kubectl delete rc -l name=node-js
$ kubectl create -f nodejs-health-controller-2.yaml

Now the load balancer IP should work once again. Keep these pods around as we will use them again in Chapter 3, Networking, Load Balancers, and Ingress.

TCP checks

Kubernetes also supports health checks via simple TCP socket checks and also with custom command-line scripts.

The following snippets are examples of what both use cases look like in the YAML file. 

Health check using command-line script:

livenessProbe: 
  exec: 
    command: 
    -/usr/bin/health/checkHttpServce.sh 
  initialDelaySeconds:90 
  timeoutSeconds: 1

Health check using simple TCP Socket connection:

livenessProbe: 
  tcpSocket: 
    port: 80 
  initialDelaySeconds: 15 
  timeoutSeconds: 1

Life cycle hooks or graceful shutdown

As you run into failures in real-life scenarios, you may find that you want to take additional action before containers are shut down or right after they are started. Kubernetes actually provides life cycle hooks for just this kind of use case.

The following example controller definition, apache-hooks-controller.yaml, defines both a postStart action and a preStop action to take place before Kubernetes moves the container into the next stage of its life cycle:

apiVersion: v1 
kind: ReplicationController 
metadata: 
  name: apache-hook 
  labels: 
    name: apache-hook 
spec: 
  replicas: 3 
  selector: 
    name: apache-hook 
  template: 
    metadata: 
      labels: 
        name: apache-hook 
    spec: 
      containers: 
      - name: apache-hook 
        image: bitnami/apache:latest 
        ports: 
        - containerPort: 80 
        lifecycle: 
          postStart: 
            httpGet: 
              path: http://my.registration-server.com/register/ 
              port: 80 
          preStop: 
            exec: 
              command: ["/usr/local/bin/apachectl","-k","graceful-
              stop"]

You’ll note that, for the postStart hook, we define an httpGet action, but for the preStop hook, we define an exec action. Just as with our health checks, the httpGet action attempts to make an HTTP call to the specific endpoint and port combination, while the exec action runs a local command in the container.

The httpGet and exec actions are both supported for the postStart and preStop hooks. In the case of preStop, a parameter named reason will be sent to the handler as a parameter. See the following table for valid values:

Reason parameter Failure description
Delete Delete command issued via kubectl or the API
Health Health check fails
Dependency Dependency failure such as a disk mount failure or a default infrastructure pod crash
Valid preStop reasons
Check out the references section here: https://github.com/kubernetes/kubernetes/blob/release-1.0/docs/user-guide/container-environment.md#container-hooks.

It’s important to note that hook calls are delivered at least once. Therefore, any logic in the action should gracefully handle multiple calls. Another important note is that postStart runs before a pod enters its ready state. If the hook itself fails, the pod will be considered unhealthy.

Comments are closed.