Ubuntu Server 18.04 – Troubleshooting resource issues

Installing MySQL On CentOS 8

I don’t know about others, but it seems that a majority of my time troubleshooting servers usually comes down to pinpointing resource issues. By resources, I’m referring to CPU, memory, disk, input/output, and so on. Generally, issues come down to a user storing too many large files, a process going haywire that consumes a large amount of CPU, or a server running out of memory. In this section, we’ll go through some of the common things you’re likely to run into while administering Ubuntu servers.

First, let’s revisit topics related to storage. In Chapter 3, Managing Storage Volumes, we went over concepts related to this already, and many of those concepts also apply to troubleshooting as well. Therefore, I won’t spend too much time on those concepts here, but it’s worth a refresher in regard to troubleshooting storage issues. First, whenever you have users that are complaining about being unable to write new files to the server, the following two commands are the first you should run. You are probably already well aware of these, but they’re worth repeating:

df -h
df -i  

The first df command variation gives you information regarding how much space is used on a drive, in a human readable format (the -h option), which will print the information in terms of megabytes and gigabytes. The -i option in the second command gives you information regarding used and available inodes. The reason you should also run this, is because on a Linux system, it can report storage as full even if there’s plenty of free space. But if there are no remaining inodes, it’s the same as being full, but the first command wouldn’t show the usage as 100 percent when no inodes are free. Usually, the number of inodes a storage medium has available is extremely generous, and the limit is hard to hit. However, if a service is creating new log files over and over every second, or a mail daemon grows out of control and generates a huge backlog of undelivered mail, you’d be surprised how quickly inodes can empty out.

Of course, once you figure out that you have an issue with full storage, the next logical question becomes, what is eating up all my free space? The df commands will give you a list of storage volumes and their sizes, which will tell you at least which disk or partition to focus your attention on. My favorite command for pinpointing storage hogs, as I mentioned in Chapter 3, Managing Storage Volumes, is the ncdu command. While not installed by default, ncdu is a wonderful utility for checking to see where your storage is being consumed the most. If run by itself, ncdu will scan your server’s entire filesystem. Instead, I recommend running it with the -x option, which will limit it to a specific folder as a starting point. For example, if the /home partition is full on your server, you might want to run the following to find out which directory is using the most space:

    sudo ncdu -x /home  

The -x option will cause ncdu to not cross filesystems. This means if you have another disk mounted within the folder you’re scanning, it won’t touch it. With -x, ncdu is only concerned with the target you give it.

If you aren’t able to utilize ncdu, there’s also the du command that takes some extra work. The du -h command, for example, will give you the current usage of your current working directory, with human-readable numbers. It doesn’t traverse directory trees by default like ncdu does, so you’d need to run it on each sub-directory until you manually find the directory that’s holding the most files. A very useful variation of the du command, nicknamed ducks, is the following. It will show you the top 15 largest directories in your current working directory:

du -cksh * | sort -hr | head -n 15

Another issue with storage volumes that can arise is issues with filesystem integrity. Most of the time, these issues only seem to come up when there’s an issue with power, such as a server powering off unexpectedly. Depending on the server and the formatting you’ve used when setting up your storage volumes (and several other factors), power issues are handled differently from one installation to another. In most cases, a filesystem check (fsck) will happen automatically during the next boot. If it doesn’t, and you’re having odd issues with storage that can’t be explained otherwise, a manual filesystem check is recommended. Scheduling a filesystem check is actually very easy:

    sudo touch /forcefsck  

The previous command will create an empty file, forcefsck, at the root of the filesystem. When the server reboots and it sees this file, it will trigger a filesystem check on that volume and then remove the file. If you’d like to check a filesystem other than the root volume, you can create the forcefsck file elsewhere. For example, if your server has a separate /home partition, you could create the file there instead to check that volume:

    sudo touch /home/forcefsck  

The filesystem check will usually complete fairly quickly, unless there’s an issue it needs to fix. Depending on the nature of the problem, the issue could be repaired quickly or perhaps it will take a while. I’ve seen some really bad integrity issues that have taken over four hours to fix, but I’ve seen others fixed in a matter of seconds. Sometimes it will finish so quickly that it will scroll by so fast during boot that you may miss seeing it. In case of a large volume, you may want to schedule the fsck check to happen off-hours in case the scan takes a long time.

With regards to issues with memory, the free -m command will give you an overview of how much memory and swap is available on your server. It won’t tell you what exactly is using up all your memory, but you’ll use it to see if you’re in jeopardy of running out. The free column from the output of the free command will show you how much memory is remaining, and allow you to make a decision on when to take action:

Output of the dig and host commands

In Chapter 6, Controlling and Monitoring Processes, we took a look at the htop command, which helps us answer the question of “what” is using up our resources. Using htop (once installed), you can sort the list of processes by CPU or memory usage by pressing F6, and then selecting a new sort field, such as PERCENT_CPU or PERCENT_MEM. This will give you an idea of what is consuming resources on your server, allowing you to make a decision on what to do about it. The action you take will differ from one process to another, and your solution may range from adding more memory to the server to tuning the application to have a lower memory ceiling. But what do you do when the results from htop don’t correlate to the usage you’re seeing? For example, what if your load average is high, but no process seems to be consuming a large portion of CPU?

One command I haven’t discussed so far in this tutorial is iotop. While not installed by default, the iotop utility is definitely a must-have, so I recommend you install the iotop package. The iotop utility itself needs to be run as root or with sudo:

    sudo iotop  

The iotop command will allow you to see how much data is being written to or read from your disks. Input/output definitely contributes to a system’s load, and not all resource monitoring utilities will show this usage. If you see a high load average but nothing in your resource monitor shows anything to account for it, check the IO. The iotop utility is a great way to do that, as if data is bottle-necked while being written to disk, that can account for a serious overhead in IO that will slow other processes down. If nothing else, it will give you an idea of which process is misbehaving, in case you need to kill it:

The iotop utility running on an Ubuntu Server

The iotop window will refresh on its own, sorting processes by the column that is highlighted. To change the highlight, you’ll only need to press the left and right arrows on your keyboard. You can sort processes by columns such as IO, SWAPIN, DISK WRITE, DISK READ, and others. When you’re finished with the application, press Q to quit.

The utilities we looked at in this section are very useful when identifying issues with bottle-necked resources. What you do to correct the situation after you find the culprit will depend on the daemon. Perhaps there’s an invalid configuration, or the daemon has encountered a fault and needs to be restarted. Often, checking the logs may lead you to an answer as to why a daemon misbehaves. In the case of a full storage, almost nothing beats ncdu, which will almost always lead you directly to the problem. Tools such as htop and iotop allow you to view additional information regarding resource usage as well, and htop even allows you to kill a misbehaving process right from within the application, by pressing F9.

Comments are closed.