Ubuntu Server 18.04 – Conducting a root cause analysis

How to change the time zone on Windows Server 2019

Once you solve a problem on your server or network, you’ll immediately revel in the awesomeness of your troubleshooting skills. It’s a wonderful feeling to have fixed an issue, becoming the hero within your technology department. But you’re not done yet. The next step is looking toward preventing this problem from happening again. It’s important to look at how the problem started as well as steps you can take in order to help stop the problem from occurring again. This is known as a root cause analysis. A root cause analysis may be a report you file with your manager or within your knowledge-base system, or it could just be a memo you document for yourself. Either way, it’s an important learning opportunity.

A good root cause analysis has several sides to the equation. First, it will demonstrate the events that led to the problem occurring in the first place. Then, it will contain a list of steps that you’ve completed to correct the problem. If the problem is something that could potentially recur, you would want to include information about how to prevent it from happening again in the future.

The problem with a root cause analysis is that it’s rare that you can be 100 percent accurate. Sometimes, the root cause may be obvious. For example, suppose a user named Bob deleted an entire directory that contained files important to your company. If you log in into the server and check the logs, you can see that Bob not only logged into the server near the time of the incident, his bash history literally shows him running the rm -rf /work/important-files command. At this point, case is closed. You figured out how the problem happened, who did it, and you can restore the files from your most recent backup. But a root cause is usually not that cut and dry.

One example I’ve personally encountered was a pair of virtual machine servers that were “fencing.” At a company I once worked for, our Citrix-based virtual machine servers (which were part of a cluster), both went down at the same time, taking every Linux VM down with them. When I attached a monitor to them, I could see them both rebooting over and over. After I got the servers to settle down, I started to investigate deeper. I read in the documentation for Citrix XenServer that you should never install a cluster of anything less than three machines because it can create a situation exactly as I experienced. We only had two servers in that cluster, so I concluded that the servers were set up improperly and the company would need a third server if they wanted to cluster them.

The problem though, is that example root cause analysis wasn’t 100 percent perfect. Were the servers having issues because they needed a third server? The documentation did mention that three servers were a minimum, but there’s no way to know for sure that was the reason the problem started. However, not only was I not watching the servers when it happened, I also wasn’t the individual who set them up, whom had already left the company. There was no way I could reach a 100 percent conclusion, but my root cause analysis was sound in the sense that it was the most likely explanation (that we weren’t using best practices). Someone could counter my root cause analysis with the question “but the servers were running fine that way for several years.” True, but nothing is absolute when dealing with technology. Sometimes, you never really know. The only thing you can do is make sure everything is set up properly according to the guidelines set forth by the manufacturer.

A good root cause analysis is as sound in logic as it can be, though not necessarily bulletproof. Correlating system events to symptoms is often a good first step, but is not necessarily perfect. After investigating the symptoms, solving the issue, and documenting what you’ve done to rectify it, sometimes the root cause analysis writes itself. Other times, you’ll need to read documentation and ensure that the configuration of the server or daemon that failed was implemented along with best practices. In a worst-case scenario, you won’t really know how the problem happened or how to prevent it, but it should still be documented in case other details come to light later. And without documentation, you’ll never gain anything from the situation.

A root cause analysis should include details such as the following:

  • A description of the issue
  • Which application or piece of hardware encountered a fault
  • The date and time the issue was first noticed
  • What you found while investigating the issue
  • What you’ve done to resolve the issue
  • What events, configurations, or faults caused the issue to happen

A root cause analysis should be used as a learning experience. Depending on what the issue was, it may serve as an example of what not to do, or what to do better. In the case of my virtual machine server fiasco, the moral of the story was to follow best practices from Citrix and use three servers for the cluster instead of two. Other times, the end result may be another technician not following proper directives or making a mistake, which is unfortunate. In the future, if the issue were to happen again, you’ll be able to look back and remember exactly what happened last time and what you did to fix it. This is valuable, if only for the reason we’re all human and prone to forgetting important details after a time. In an organization, a root cause analysis is valuable to show stakeholders that you’re able to not only address a problem, but are reasonably able to prevent it from happening again.

Comments are closed.