loading...

Ubuntu Server 18.04 – Diagnosing defective RAM

How to Install Docker on macOS

All server and computing components can and will fail eventually, but there are a few pieces of hardware that seem to fail more often than others. Fans, power supplies, and hard disks definitely make the list of common things administrators will end up replacing, but defective memory is also a situation I’m sure you’ll run into eventually.

Although memory sticks becoming defective is something that could happen, I made it the last section in this chapter because it’s one of those situations where I can’t give you a definite list of symptoms to look out for that will point to memory being the source of an issue you may be experiencing. RAM issues are very mysterious in nature, and each time I’ve run into one, I’ve always stumbled across memory being bad only after troubleshooting everything else. It’s for this reason that nowadays I’ll often test the memory on a server or workstation first, since it’s very easy to do. Even if memory has nothing to do with an issue, it’s worth checking anyway since it could become a problem later.

Most distributions of Linux (Ubuntu included) feature Memtest86+ right on the installation media. Whether you create a bootable CD or flash drive, there’s a memory test option available from the Ubuntu Server media. When you first boot from the Ubuntu Server media, you’ll seen an icon toward the bottom indicating you can press a key to bring up a menu (if you don’t press a key, the installer will automatically start). Next, you’ll be asked to choose your language, and then you’ll be shown an installation menu. Among the choices there will be an option to Test memory:

The main menu of the Ubuntu installer, showing a memory test option

Other editions of Ubuntu, such as the Ubuntu desktop distribution or any of its derivatives, also feature an option to test memory. Even if you don’t have installation media handy for the server edition, you can use whichever version you have. From one distribution or edition to another, the Memtest86+ program doesn’t change.

When you choose the Test memory option from your installation media, the Memtest86+ program will immediately get to work and start testing your memory (press Esc to exit the test). The test may take a long time, depending on how much memory your workstation or server has installed. It can take minutes or even hours to complete. Generally speaking, when your machine has defective RAM, you’ll see a bunch of errors show up relatively quickly, usually within the first 5-10 minutes. If you don’t see errors within 15 minutes, you’re most likely in good shape. In my experience, every time I’ve run into defective memory, I’ll see errors in 15 minutes or less (usually within 5). Theoretically, though, you could very well have a small issue with your memory modules that may not show up until after 15 minutes, so you should let the test finish if you can spare the time for it:

Memtest86+ in action

The main question becomes when to run Memtest86+ on a machine. In my experience, symptoms of bad memory are almost never the same from one machine to another. Usually, you’ll run into a situation where a server doesn’t boot properly, applications close unexpectedly, applications don’t start at all, or perhaps an application is behaving irregularly. In my view, testing memory should be done whenever you experience a problem that doesn’t necessarily seem straightforward. In addition, you may want to consider testing the memory on your server before you roll it out into production. That way, you can assure that it starts out as free of hardware issues as possible. If you install new memory modules, make sure to test the RAM right away.

If the test does report errors, you’ll next want to find out which memory module is faulty. This can be difficult, as some servers can have more than a dozen memory modules installed. To narrow it down, you’d want to test each memory module independently if you can, until you find out which one is defective. You should also continue to test the other modules, even after you discover the culprit. The reason for this is that having multiple memory modules going bad isn’t outside the realm of possibility, considering whatever situation led to the first module becoming defective may have affected others.

Another tip I’d like to pass along regarding memory is that when you do discover a bad stick of memory, it’s best to erase the hard disk and start over if you can. I understand that this isn’t always feasible, and you could have many hours logged into setting up a server. Some servers can take weeks to rebuild, depending on their workload. But at least keep in mind that any data that passes through defective RAM can become corrupted. This means that data at rest (data stored on your hard disk) may be corrupt if it was sitting in a defective area of RAM before it was written to disk. When a server or workstation encounters defective RAM, you really can’t trust it anymore. I’ll leave the decision on how to handle this situation up to you (hopefully you’ll never encounter it at all), but just keep this in mind as you plan your course of action. Personally, I don’t trust an installation of any operating system after its hardware has encountered such issues.

I also recommend that you check the capacitors on your server’s motherboard whenever you’re having odd issues. Although this isn’t necessarily related to memory, I mention it here because the symptoms are basically the same as bad memory when you have bad capacitors. I’m not asking you to get a voltage meter or do any kind of electrician work, but sometimes it may make sense to open the case of your server, shine a flashlight on the capacitors, and see if any of them appear to be leaking fluid or expanding. The reason I bring this up is because I’ve personally spent hours troubleshooting a machine (more than once) where I would test the memory and hard disk, and look through system logs, without finding any obvious causes, only to later look at the hardware and discover capacitors on the motherboard were leaking. It would have saved me a lot of time if I had simply looked at the capacitors. And that’s really all you have to do, just take a quick glance around the motherboard and look for anything that doesn’t seem right.

Comments are closed.

loading...