Kubernetes – Maturing our monitoring operations

While Grafana gives us a great start to monitoring our container operations, it is still a work in progress. In the real world of operations, having a complete dashboard view is great once we know there is a problem. However, in everyday scenarios, we’d prefer to be proactive and actually receive notifications when issues arise. This kind of alerting capability is a must to keep the operations team ahead of the curve and out of reactive mode.

There are many solutions available in this space, and we will take a look at two in particular: GCE monitoring (Stackdriver) and Sysdig.

GCE (Stackdriver)

Stackdriver is a great place to start for infrastructure in the public cloud. It is actually owned by Google, so it’s integrated as the Google Cloud Platform monitoring service. Before your lock-in alarm bells start ringing, Stackdriver also has solid integration with AWS. In addition, Stackdriver has alerting capability with support for notification to a variety of platforms and webhooks for anything else.

Signing up for GCE monitoring

In the GCE console, in the Stackdriver section, click on  Monitoring. This will open a new window, where we can sign up for a free trial of Stackdriver. We can then add our GCP project and optionally an AWS account as well. This requires a few more steps, but instructions are included on the page. Finally, we’ll be given instructions on how to install the agents on our cluster nodes. We can skip this for now, but will come back to it in a minute.

Click on  Continue, set up your daily alerts, and click on  Continue again.

Click on Launch Monitoring to proceed. We’ll be taken to the main dashboard page, where we will see some basic statistics on our node in the cluster. If we select  Resources from the side menu and then Instances, we’ll be taken to a page with all our nodes listed. By clicking on the individual node, we can again see some basic information even without an agent installed.

Stackdriver also offers monitoring and logging agents that can be installed on the nodes. However, it currently does not support the container OS that is used by default in the GCE kube-up script. You can still see the basic metrics for any nodes in GCE or AWS, but will need to use another OS if you want a detailed agent installation.

Alerts

Next, we can look at the alerting policies available as part of the monitoring service. From the instance details page, click on the Create Alerting Policy button in the Incidents section at the top of the page.

We will click on  Add Condition and select a M etric Threshold. In the Target section, set RESOURCE TYPE to Instance (GCE). Then, set APPLIES TO to Group and kubernetes. Leave CONDITION TRIGGERS IF set to Any Member Violates.

In the Configuration section, leave IF METRIC as CPU Usage (GCE Monitoring) and CONDITION as above. Now, set THRESHOLD to 80 and set the time in  FOR to 5 minutes.

Then click on Save Condition:

Google Cloud Monitoring alert policy

Next, we will add a notification. In the Notification section, leave Method as Email and enter your email address.

We can skip the Documentation section, but this is where we can add text and formatting to alert messages.

Finally, name the policy Excessive CPU Load and click on  Save Policy.

Now, whenever the CPU from one of our instances goes above 80 percent, we will receive an email notification. If we ever need to review our policies, we can find them in the Alerting drop-down and then in  Policies Overview in the menu on the left-hand side of the screen.

Beyond system monitoring with Sysdig

Monitoring our cloud systems is a great start, but what about the visibility of the containers themselves? Although there are a variety of cloud monitoring and visibility tools, Sysdig stands out for its ability to dive deep, not only into system operations, but specifically containers.

Sysdig is open source and is billed as a universal system visibility tool with native support for containers. It is a command line tool that provides insight into the areas we looked at earlier, such as storage, network, and system processes. What sets it apart is the level of detail and visibility it offers for these process and system activities. Furthermore, it has native support for containers, which gives us a full picture of our container operations. This is a highly recommended tool for your container operations arsenal. The main website of Sysdig is http://www.sysdig.org/.

Sysdig Cloud

We will take a look at the Sysdig tool and some of the useful command line-based UIs in a moment. However, the team at Sysdig has also built a commercial product, named Sysdig Cloud, which provides the advanced dashboard, alerting, and notification services we discussed earlier in the chapter. Also, the differentiator here has high visibility into containers, including some nice visualizations of our application topology.

If you’d rather skip the Sysdig Cloud section and just try out the command-line tool, simply skip to The Sysdig command line section later in this chapter.

If you have not done so already, sign up for Sysdig Cloud at http://www.sysdigcloud.com.

After activating and logging in for the first time, we’ll be taken to a welcome page. Clicking on Next, we are shown a page with various options to install the Sysdig agents. For our example environment, we will use the Kubernetes setup. Selecting Kubernetes will give you a page with your API key and a link to instructions. The instructions will walk you through how to create a Sysdig agent DaemonSet on your cluster. Don’t forget to add the API key from the install page. 

We will not be able to continue on the install page until the agents connect. After creating the DaemonSet and waiting a moment, the page should continue to the AWS integration page. You can fill this out if you like, but for this walk-through, we will click on  Skip. Then, click on  Let’s Get Started.

As of this writing, Sysdig and Sysdig Cloud were not fully compatible with the latest container OS deployed by default in the GCE kube-up script, Container-optimized OS from Google: https://cloud.google.com/container-optimized-os/docs.

We’ll be taken to the main Sysdig Cloud dashboard screen. We should see at least two minion nodes appear under the Explore tab. We should see something similar to the following screenshot with our minion nodes:

Sysdig Cloud Explore page

This page shows us a table view, and the links on the left let us explore some key metrics for CPU, memory, networking, and so on. Although this is a great start, the detailed views will give us a much deeper look at each node.

Detailed views

Let’s take a look at these views. Select one of the minion nodes and then scroll down to the detail section that appears below. By default, we should see the System: Overview by Process view (if it’s not selected, just click on it from the list on the left-hand side). If the chart is hard to read, simply use the maximize icon in the top-left corner of each graph for a larger view.

There are a variety of interesting views to explore. Just to call out a few others, Services |  HTTP Overview and Hosts & Containers |  Overview by Container give us some great charts for inspection. In the latter view, we can see stats for CPU, memory, network, and file usage by container.

Topology views

In addition, there are three topology views at the bottom. These views are perfect for helping us understand how our application is communicating. Click on Topology |  Network Traffic and wait a few seconds for the view to fully populate. It should look similar to the following screenshot:

Sysdig Cloud network topology view

Note that the view maps out the flow of communication between the minion nodes and the master in the cluster. You may also note a + symbol in the top corner of the node boxes. Click on that in one of the minion nodes and use the zoom tools at the top of the view area to zoom into the details, as shown in the following screenshot:

The Sysdig Cloud network topology detailed view

Note that we can now see all the components of Kubernetes running inside the master. We can see how the various components work together. We can see kube-proxy and the kubelet process running, as well as a number of boxes with the Docker whale, which indicate that they are containers. If we zoom in and use the plus icon, we can see that these are the containers for our pods and core Kubernetes processes, as we saw in the services running on the master section in Chapter 1, Introduction to Kubernetes.

Also, if you have the master included in your monitored nodes, you can watch kubelet  initiate communication from a minion and follow it all the way through the kube-apiserver container in the master.

We can even sometimes see the instance communicating with the GCE infrastructure to update metadata. This view is great in order to get a mental picture of how our infrastructure and underlying containers are talking to one another.

Metrics

Next, let’s switch over to the Metrics tab in the left-hand menu next to Views. Here, there are also a variety of helpful views.

Let’s look at capacity.estimated.request.total.count in  System. This view shows us an estimate of how many requests a node is capable of handling when fully loaded. This can be really useful for infrastructure planning:

Sysdig Cloud capacity estimate view

Alerting

Now that we have all this great information, let’s create some notifications. Scroll back up to the top of the page and find the bell icon next to one of your minion entries. This will open a Create Alert dialog. Here, we can set manual alerts similar to what we did earlier in the chapter. However, there is also the option to use BASELINE and HOST COMPARISON.

Using the BASELINE option is extremely helpful, as Sysdig will watch the historical patterns of the node and alert us whenever one of the metrics strays outside the expected metric thresholds. No manual settings are required, so this can really save time for the notification setup and help our operations team to be proactive before issues arise. Refer to the following screenshot:

Sysdig Cloud new alert

The HOST COMPARISON option is also a great help as it allows us to compare metrics with other hosts and alert whenever one host has a metric that differs significantly from the group. A great use case for this is monitoring resource usage across minion nodes to ensure that our scheduling constraints are not creating a bottleneck somewhere in the cluster.

You can choose whichever option you like and give it a name and warning level. Enable the notification method. Sysdig supports email, SNS (short for Simple Notification Service), and PagerDuty as notification methods. You can optionally enable Sysdig Capture to gain deeper insight into issues. Once you have everything set, just click on Create and you will start to receive alerts as issues come up.

The Sysdig command line

Whether you only use the open source tool or you are trying out the full Sysdig Cloud package, the command line utility is a great companion to have to track down issues or get a deeper understanding of your system.

In the core tool, there is the main sysdig utility and also a command line-style UI named csysdig. Let’s take a look at a few useful commands.

Find the relevant installation instructions for your OS here: http://www.sysdig.org/install/.

Once installed, let’s first look at the process with the most network activity by issuing the following command:

$ sudo sysdig -pc -c topprocs_net

The following screenshot is the result of the preceding command:

A Sysdig top process by network activity

This is an interactive view that will show us a top process in terms of network activity. Also, there are a plethora of commands to use with sysdig. A few other useful commands to try out include the following:

$ sudo sysdig -pc -c topprocs_cpu
$ sudo sysdig -pc -c topprocs_file
$ sudo sysdig -pc -c topprocs_cpu container.name=<Container Name NOT ID>
More examples can be found at http://www.sysdig.org/wiki/sysdig-examples/.

The Csysdig command-line UI

Just because we are in a shell on one of our nodes doesn’t mean we can’t have a UI. Csysdig is a customizable UI for exploring all the metrics and insight that Sysdig provides. Simply type csysdig in the prompt:

$ csysdig

After entering csysdig, we will see a real-time listing of all processes on the machine. At the bottom of the screen, you’ll note a menu with various options. Click on Views or press  F2 if you love to use your keyboard. In the left-hand menu, there are a variety of options, but we’ll look at threads. Double-click on  Threads.

On some operating systems and with some SSH clients, you may have issues with the function keys. Check the settings on your terminal and make sure that the function keys are using the VT100+ sequences.

We can see all the threads currently running on the system and some information about the resource usage. By default, we see a big list that is updated often. If we click on the Filter, F4 for the mouse-challenged, we can slim down the list.

Type kube-apiserver, if you are on the master, or kube-proxy, if you are on a node (minion), in the filter box and press Enter. The view now filters for only the threads in that command:

Csysdig threads

If we want to inspect this a little further, we can simply select one of the threads in the list and click on Dig or press  F6. Now, we see a detailed listing of system calls from the command in real time. This can be a really useful tool to gain deep insight into the containers and processes running on our cluster.

Click on  Back or press the Backspace key to go back to the previous screen. Then, go to Views once more. This time, we will look at the Containers view. Once again, we can filter and also use the Dig view to get more in-depth visibility into what is happening at the system call level.

Another menu item you might note here is Actions, which is available in the newest release. These features allow us to go from process monitoring to action and response. It gives us the ability to perform a variety of actions from the various process views in Csysdig. For example, the container view has actions to drop into a Bash shell, kill containers, inspect logs, and more. It’s worth getting to know the various actions and hotkeys, and even add your own custom hotkeys for common operations.

Prometheus

A newcomer to the monitoring scene is an open source tool called Prometheus. Prometheus is an open source monitoring tool that was built by a team at SoundCloud. You can find more about the project at https://prometheus.io.

Their website offers the following features: 

  • A multi-dimensional data model (https://prometheus.io/docs/concepts/data_model/) (the time series are identified by their metric name and key/value pairs)
  • A flexible query language (https://prometheus.io/docs/prometheus/latest/querying/basics/) to leverage this dimensionality
  • No reliance on distributed storage; single-server nodes are autonomous
  • Time series collection happens via a pull model over HTTP
  • Pushing time series (https://prometheus.io/docs/instrumenting/pushing/) is supported via an intermediary gateway
  • Targets are discovered via service discovery or static configuration
  • Multiple modes of graphing and dashboard support

Prometheus summary

Prometheus offers a lot of value to the operators of a Kubernetes cluster. Let’s look at some of the more important dimensions of the software:

  • Simple to operate: It was built to run as individual servers using local storage for reliability
  • It’s precise: You can use a query language similar to JQL, DDL, DCL, or SQL queries to define alerts and provide a multi-dimensional view of status
  • Lots of libraries: You can use more than ten languages and numerous client libraries in order to introspect your services and software
  • Efficient: With data stored in an efficient, custom format both in memory and on disk, you can scale out easily with sharding and federation, creating a strong platform from which to issue powerful queries that can construct powerful data models and ad hoc tables, graphs, and alerts

Also, Promethus is 100% open source and is (as of July 2018) currently an incubating project in the CNCF. You can install it with Helm as we did with other software, or do a manual installation as we’ll detail here. Part of the reason that we’re going to look at Prometheus today is due to the overall complexity of the Kubernetes system. With lots of moving parts, many servers, and potentially differing geographic regions, we need a system that can cope with all of that complexity.

A nice part about Prometheus is the pull nature, which allows you to focus on exposing metrics on your nodes as plain text via HTTP, which Prometheus can then pull back to a central monitoring and logging location. It’s also written in Go and inspired by the closed source Borgmon system, which makes it a perfect match for our Kubernetes cluster. Let’s get started with an install!

Prometheus installation choices

As with previous examples, we’ll need to either use our local Minikube install or the GCP cluster that we’ve spun up. Log in to your cluster of choice, and then let’s get Prometheus set up. There’s actually lots of options for installing Prometheus due to the fast moving nature of the software:

  • The simplest, manual method; if you’d like to build the software from the getting started documents, you can jump in with https://prometheus.io/docs/prometheus/latest/getting_started/ and get Prometheus monitoring itself.
  • The middle ground, with Helm; if you’d like to take the middle road, you can install Prometheus on your cluster with Helm (https://github.com/helm/charts/tree/master/stable/prometheus).
  • The advanced Operator method; if you want to use the latest and greatest, let’s take a look at the Kubernetes Operator class of software, and use it to install Prometheus. The Operator was created by CoreOS, who have recently been acquired by Red Hat. That should mean interesting things for Project Atomic and Container Linux. We’ll talk more about that later, however! We’ll use the Operator model here.
The Operator is designed to build upon the Helm-style management of software in order to build additional human operational knowledge into the installation, maintenance, and recovery of applications. You can think of the Operator software just like an SRE Operator: someone who’s an expert in running a piece of software.

An Operator is an application-specific controller that extends the Kubernetes API in order to manage complex stateful applications such as caches, monitoring systems, and relational or non-relational databases. The Operator uses the API in order to create, configure, and manage these stateful systems on behalf of the user. While Deployments are excellent in dealing with seamless management of stateless web applications, the Deployment object in Kubernetes struggles to orchestrate all of the moving pieces in a stateful application when it comes to scaling, upgrading, recovering from failure, and reconfiguring these systems.

You can read more about extending the Kubernetes API here: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/.

Operators leverage some core Kubernetes concepts that we’ve discussed in other chapters. Resources (ReplicaSets) and Controllers (for example Deployments, Services, and DaemonSets) are leverage with additional operational knowledge of the manual steps that are encoded in the Operator software. For example, when you scale up an etcd cluster manually, one of the key steps in the process is to create a DNS name for the new etcd member that can be used to route to the new member once it’s been added to the cluster. With the Operator pattern being used, that systematized knowledge is built into the Operator class to provide the cluster administrator with seamless updates to the etcd software.

The difficulty in creating operators is understanding the underlying functionality of the stateful software in question, and then encoding that into a resource configuration and control loop. Keep in mind that Kubernetes can be thought of as simply being a large distributed messaging queue, with messages that exist in the form of a YAML blob of declarative state that the cluster operator defines, which the Kubernetes system puts into place.

Tips for creating an Operator

If you want to create your own Operator in the future, you can keep the following tips from CoreOS in mind. Given the nature of their application-specific domain, you’ll need to keep a few things in mind when managing complex applications. First, you’ll have a set of system flow activities that your Operator should be able to perform. This will be actions such as creating a user, creating a database, modifying user permissions and passwords, and deleting users (such as the default user installed when creating many systems).

You’ll also need to manage your installation dependencies, which are the items that need to be present and configured for your system to work in the first place. CoreOS also recommends the following principles be followed when creating an Operator:

  • Single step to deploy: Make sure your Operator can be initialized and run with a single command that takes no additional work to get running.
  • New third-party type: Your Operator should leverage the third-party API types, which users will take advantage of when creating applications that use your software.
  • Use the basics: Make sure that your Operator uses the core Kubernetes objects such as ReplicaSets, Services, and StatefulSets, in order to leverage all of the hard work being poured into the open source Kubernetes project.
  • Compatible and default working: Make sure you build your Operators so that they exist in harmony with older versions, and design your system so that it still continues to run unaffected if the Operator is stopped or accidentally deleted from your cluster.
  • Version: Make sure to facilitate the ability to version instances of your Operator, so cluster administrators don’t shy away from updating your software.
  • Test: Also, make sure to test your Operator against a destructive force such as a Chaos Monkey! Your Operator should be able to survive the failure of nodes, pods, storage, configuration, and networking outages.

Installing Prometheus

Let’s run through an install of Prometheus using the new pattern that we’ve discovered. First, let’s use the Prometheus definition file to create the deployment. We’ll use Helm here to install the Operator!

Make sure you have Helm installed, and then make sure you’ve initialized it:

$ helm init
master $ helm init
Creating /root/.helm
...
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /root/.helm.
...
Happy Helming!
$

Next, we can install the various Operator packages required for this demo:

$ helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
"coreos" has been added to your repositories

Now, install the Operator:

$ helm install coreos/prometheus-operator --name prometheus-operator

You can see that it’s installed and running by first checking the installation:

$ helm ls prometheus-operator
NAME                    REVISION UPDATED                        STATUS CHART NAMESPACE
prometheus-operator     1 Mon Jul 23 02:10:18 2018        DEPLOYED prometheus-operator-0.0.28 default

Then, look at the pods:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
prometheus-operator-d75587d6-bmmvx 1/1 Running 0 2m

Now, we can install kube-prometheus to get all of our dependencies up and running:

$ helm install coreos/kube-prometheus --name kube-prometheus --set global.rbacEnable=true
NAME:   kube-prometheus
LAST DEPLOYED: Mon Jul 23 02:15:59 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/Alertmanager
NAME             AGE
kube-prometheus  1s

==> v1/Pod(related)
NAME                                                  READY STATUS RESTARTS AGE
kube-prometheus-exporter-node-45rwl                   0/1 ContainerCreating 0 1s
kube-prometheus-exporter-node-d84mp                   0/1 ContainerCreating 0 1s
kube-prometheus-exporter-kube-state-844bb6f589-z58b6  0/2 ContainerCreating 0 1s
kube-prometheus-grafana-57d5b4d79f-mgqw5              0/2 ContainerCreating 0 1s

==> v1beta1/ClusterRoleBinding
NAME                                     AGE
psp-kube-prometheus-alertmanager         1s
kube-prometheus-exporter-kube-state      1s
psp-kube-prometheus-exporter-kube-state  1s
psp-kube-prometheus-exporter-node        1s
psp-kube-prometheus-grafana              1s
kube-prometheus                          1s
psp-kube-prometheus                      1s
…

We’ve truncated the output here as there’s a lot of information. Let’s look at the pods again:

$ kubectl get pods
NAME                                                   READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-0                         2/2 Running 0 3m
kube-prometheus-exporter-kube-state-85975c8577-vfl6t   2/2
Running 0 2m
kube-prometheus-exporter-node-45rwl                    1/1 Running 0 3m
kube-prometheus-exporter-node-d84mp                    1/1 Running 0 3m
kube-prometheus-grafana-57d5b4d79f-mgqw5               2/2 Running 0 3m
prometheus-kube-prometheus-0                           3/3 Running 1 3m
prometheus-operator-d75587d6-bmmvx                     1/1 Running 0 8m

Nicely done!

If you forward the port for prometheus-kube-prometheus-0 to 8448, you should be able to see the Prometheus dashboard, which we’ll revisit in later chapters as we explore high availability and the productionalization of your Kubernetes cluster. You can check this out at http://localhost:8449/alerts.

Comments are closed.