Ubuntu Server 18.04 – Evaluating the problem space

Initial Configurations of Windows server 2019

After you identify the symptoms of the issue, the first goal in troubleshooting is to identify the problem space. Essentially, this means determining (as best you can) where the problem is most likely to reside, and how many systems and services are affected. Sometimes the problem space is obvious. For example, if none of your computers are receiving an IP address from your DHCP server, then you’ll know straight away to start investigating the logs on that particular server in regard to its ability (or inability) to do the job designated for it. In other cases, the problem space may not be so obvious. Perhaps you have an application that exhibits problems every now and then, but isn’t something you can reliably reproduce. In that case, it may take some digging before you know just how large the scope of the problem might be. Sometimes, the culprit is the last thing you expect.

Each component on your network works together with other components, or at least that’s how it should be. A network of Linux servers, just as with any other network, is a collection of services (daemons) that compliment and often depend upon one another. For example, DHCP assigns IP addresses to all of your hosts, but it also assigns their default DNS servers as well. If your DNS server has encountered an issue, then your DHCP server would essentially be assigning a non-working DNS server to your clients. Identifying the problem space means that after you identify the symptoms, you’ll also work toward reaching an understanding of how each component within your network contributes to, or is affected by, the problem. This will also help you identify the scope.

With regards to the scope, we identify how far the problem reaches, as well as how many users or systems are affected by the issue. Perhaps just one user is affected, or an entire subnet. This will help you determine the priority of the issue and decide whether this is something essential that you need to fix now, or something that can wait until later. Often, prioritizing is half the battle, since each of your users will be under the impression that their issues are more important than anyone else.

When identifying the problem space, as well as the scope, you’ll want to answer the following questions as best as you can:

  • What are the symptoms of the issue?
  • When did this problem first occur?
  • Were there any changes made around the network around that same time?
  • Has this problem happened before? If so, what was done to fix it the last time?
  • Which servers or nodes are impacted by this issue?
  • How many users are impacted?

If the problem is limited to a single machine, then a few really good places to start poking around is checking who is logged in to the server and which commands have recently been entered. Quite often, I’ve found the culprit just by checking the bash history for logged on users (or users that have recently logged in). With each user account, there should be a .bash_history file in their home directory. Within this file is a list of commands that were recently entered. Check this file and see if anyone modified anything recently. I can’t tell you how many times this alone has led directly to the answer. And what’s even better, sometimes the Bash history leads to the solution. If a problem has occurred before and someone has already fixed it at some point in the past, chances are their efforts were recorded in the bash history, so you can see what the previous person did to solve the problem just by looking at it. To view the bash history, you can either view the contents of the .bash_history file in a user’s home directory, or you can simply execute the history command as that user.

Additionally, if you check who is currently logged into the server, you may be able to pinpoint if someone is working on an issue already, or perhaps something they’re doing caused the issue in the first place. If you enter the w command, you can see who is logged in to the server currently. In addition, you’ll also see the IP address of the user that’s logged in when you run this command. Therefore, if you don’t know who corresponds to a user account listed when you run the w command, you can check the IP address in your DHCP server to find out who the IP address belongs
to, so you can ask that person directly. In a perfect world, other administrators will send out a departmental email when they work on something to make sure everyone is aware. Unfortunately, many don’t do this. By checking the logged in users as well as their Bash history, you’re well on your way to determining where the problem originated.

After identifying the problem space and the scope, you can begin narrowing down the issue to help find a cause. Sometimes, the culprit will be obvious. If a website stopped working and you noticed that the Apache configuration on your web server was changed recently, you can attack the problem by investigating the change and who made it. If the problem is a network issue, such as users not being able to visit websites, the potential problem space is much larger. Your internet gateway may be malfunctioning, your DNS or DHCP server may be down, your internet provider could be having issues, or perhaps your accounting department simply forgot to pay the internet bill. As long as you are able to determine a potential list of targets to focus your troubleshooting on, you’re well on your way to finding the issue. As we go through this chapter, I’ll talk about some common issues that can come up and how to deal with them.

Comments are closed.