Ubuntu Server 18.04 – Understanding load average

How to Configure Network Static IP Address on Ubuntu 19.10

Before we close out this chapter, a very important topic to understand when monitoring processes and performance is load average, which is a series of numbers that represents your server’s trend in CPU utilization over a given time. You’ve probably already seen these series of numbers before, as there are several places in which the load average appears. If you run the htop or top command, the load average is shown within the output of each. In addition, if you execute the uptime command, you can see the load average in the output of that command as well. You can also view your load average by viewing the text file that stores it in the first place:

cat /proc/loadavg 

Personally, I habitually use the uptime command to view the load average. This command not only gives me the load average, but also tells me how long the server has been running.

The load average is broken down into three sections, each representing 1 minute, 5 minutes, and 15 minutes respectively. A typical load average may look something like the following:

0.36, 0.29, 0.31 

In this example, we have a load average of 0.36 in the 1- minute section, 0.29 in the five minute section, and 0.31 in the fifteen minute section. In particular, each number represents how many tasks were waiting on attention from the CPU for that given time period. Therefore, these numbers are really good. The server isn’t that busy, since virtually no task is waiting on the CPU at any one moment. This is contrary to something such as overall CPU percentages, which you may have seen in task managers on other platforms. While viewing your CPU usage percentage can be useful, the problem with this is that your CPUs will constantly go from a high percent of usage to a low percent of usage, which you can see for yourself by just running htop for a while. When a task does some sort of processing, you might see your cores shoot up to 100 percent and then right back down to a lower number. That really doesn’t tell you much, though. With load averages, you’re seeing the trend of usage over three given time frames, which is more accurate in determining if your server’s CPUs are running efficiently or are choking on a workload they just can’t handle.

The main question, though, is when you should be worried, which really depends on what kind of CPUs are installed on your server. Your server will have one or more CPUs, each with one or more cores. To Linux, each of these cores, whether they are physical or virtual, are the same thing (a CPU). In my case, the machine I took the earlier output from has a CPU with four cores. The more CPUs your server has, the more tasks it’s able to handle at any given time and the more flexibility you have with the load average.

When a load average for a particular time period is equal to the number of CPUs on the system, that means your server is at capacity. It’s handling a consistent number of tasks that are equal to the number of tasks it can handle. If your load average is consistently more than the number of cores you have available, that’s when you’d probably want to look into the situation. It’s fine for your server to be at capacity every now and then, but if it always is, that’s a cause for alarm.

I’d hate to use a cliché example in order to fully illustrate this concept, but I can’t resist, so here goes. A load average on a Linux server is equivalent to the check-out area at a supermarket. A supermarket will have several registers open, where customers can pay to finalize their purchases and move along. Each cashier is only able to handle one customer at a time. If there are more customers waiting to check out then there are cashiers, the lines will start to back up and customers will get frustrated. In a situation where there are four cashiers and four customers being helped at a time, the cashiers would be at capacity, which is not really a big deal since no one is waiting. What can add to this problem is a customer that is paying by check and/or using a few dozen coupons, which makes the checkout process much longer (similar to a resource-intensive process).

Just like the cashier’s, a CPU can only handle one task at a time, with some tasks hogging the CPU longer than others. If there are exactly as many tasks as there are CPUs, there’s no cause for concern. But if the lines start to back up, we may want to investigate what is taking so long. To take action, we may hire an additional cashier (add a new CPU) or ask a disgruntled customer to leave (kill a process).

Let’s take a look at another example load average:

1.87, 1.53, 1.22 

In this situation, we shouldn’t be concerned, because this server has four CPUs, and none of them have been at capacity within the 1, 5, or 15-minute time periods. Even though the load is consistently higher than 1, we have CPU resources to spare, so it’s no big deal. Going back to our supermarket example, this is equivalent to having four cashiers with an average of almost two customers being assisted during any 1 minute. If this server only had one CPU, we would probably want to figure out what’s causing the line to begin to back up.

It’s normal for a server to always have a workload (so long as it’s lower than the number of CPUs available), since that just means that our server is being utilized, which is the whole point of having a server to begin with (servers exist to do work). While typically, the lower the load average the better, depending on the context, it might actually be a cause for alarm if the load is too low. If your server’s load average drops to an average of zero-something, that might mean that a service that would normally be running all the time has failed and exited. For example, if you have a database server that constantly has a load within the 1x range that suddenly drops to 0x, that might mean that you either have legitimately less traffic or the database server service is no longer running. This is why it’s always a good idea to develop baselines for your server, in order to gauge what is normal and what isn’t.

Overall, load averages are something you’ll become very familiar with as a Linux administrator if you haven’t already. As a snapshot in time of how heavily utilized your server is, it will help you to understand when your server is running efficiently and when it’s having trouble. If a server is having trouble keeping up with the workload you’ve given it, it may be time to consider increasing the number of cores (if you can) or scaling out the workload to additional servers. When troubleshooting utilization, planning for upgrades, or designing a cluster, the process always starts with understanding your server’s load average so you can plan your infrastructure to run efficiently for its designated purpose.

Comments are closed.