Ubuntu Server 18.04 – Replacing failed RAID disks

How to enable Remote Desktop on Windows Server 2019

RAID is a very useful technology, as it can help your server survive through the crash of a single disk. RAID is not a backup solution, but more of a safety net that will hopefully prevent you from having to reload a server. The idea behind RAID is having redundancy, so that data is mirrored or striped among several disks. With most RAID configurations, you can survive the loss of a single disk, so if a disk fails, you can usually replace it and re-sync and be back to normal. The server itself will continue to work, even if there is a failed disk. However, losing additional disks will likely result in failure right away. When a RAID disk fails, you will need to replace that disk as quick as you can, hopefully before the other disk goes too.

The default live installer for Ubuntu Server doesn’t offer a RAID setup option, but the alternate installer does. If you wish to set up Ubuntu Server, check out the Appendix at the end of this tutorial.

To check the status of a RAID configuration, you would use the following command:

cat /proc/mdstat
A healthy RAID array

In this screenshot, we have a RAID 1 array with two disks. We can tell this from the active raid1 portion of the output. On the next line down, we see this:

[UU]

Believe it or not, this references a healthy RAID array, which means both disks are online and are working properly. If any one of the Us changes to an underscore, then that means a disk has gone offline and we will need to replace it. Here’s a screenshot showing output from that same command on a server with a failed RAID disk:

RAID status output with a faulty drive

As you can see from the screenshot, we have a problem. The /dev/sda disk is online, but /dev/sdb has gone offline. So, what should we do? First, we would need to make sure we understand which disk is working, and which disk is the one that’s faulty. We already know that the disk that’s faulty is /dev/sdb, but when we open the server’s case, we’re not going to know which disk /dev/sdb actually is. If we pull the wrong disk, we can make this problem much worse than it already is. We can use the hdparm command to get a little more info from our drive. The following command will give us info regarding /dev/sda, the disk that’s currently still functioning properly:

sudo hdparm -i /dev/sda
Output of the hdparm command

The reason why we’re executing this command against a working drive is because we want to make sure we understand which disk we should NOT remove from the server. Also, the faulty drive may not respond to our attempts to interrogate information from it. Currently, /dev/sda is working fine, so we will not want to disconnect the cables attached to that drive at any point. If you have a RAID array with more than two disks, you’ll want to execute the hdparm command against each. From the output of the hdparm command, we can see that /dev/sda has a serial number of 45M4B24AS. When we look inside the case, we can compare the serial number on the drives label and make sure we do not remove the drive with this serial number.

Next, assuming we already have the replacement disk on hand, we will want to power down the server. Depending on what the server is used for, we may need to do this after hours, but we typically cannot remove a disk while a server is running. Once it’s shut down, we can narrow down which disk /dev/sdb is (or whatever drive designation the failed drive has) and replace it. Then, we can power on the server (it will probably take much longer to boot this time; that’s to be expected given our current situation).

However, simply adding a replacement disk will not automatically resolve this issue. We need to add the new disk to our RAID array for it to accept it and rebuild the RAID. This is a manual process. The first step in rebuilding the RAID array is finding out which designation our new drive received, so we know which disk we are going to add to the array. After the server boots, execute the following command:

sudo fdisk -l

You’ll see output similar to the following:

Checking current disks with fdisk

From the output, it should be obvious which disk the new one is. /dev/sda is our original disk, and /dev/sdb is the one that was just added. To make it more obvious, we can see from the output that /dev/sda has a partition, of type Linux raid autodetect. /dev/sdb doesn’t have this.

So now that we know which disk is the new one, we can add it to our RAID array. First, we need to copy over the partition tables from the first disk to the new one. The following command will do that:

sudo sfdisk -d /dev/sda | sfdisk sudo /dev/sdb

Essentially, we are cloning the partition table from /dev/sda (the working drive) to /dev/sdb (the one we just replaced). If you run the same fdisk command we ran earlier, you should see that they both have partitions of type Linux raid autodetect now:

sudo fdisk -l

Now that the partition table has been taken care of, we can add the replaced disk to our array with the following command:

sudo mdadm --manage /dev/md0 --add /dev/sdb1

You should see output similar to the following:

mdadm: added /dev/sdb1

With this command, we are essentially adding the /dev/sdb1 disk to a RAID array designated as /dev/md0. With the last part, you want to make sure you’re executing this command against the correct array designation. If you don’t know what that is, you will see it in the output of the fdisk command we executed earlier.

Now, we should verify that the RAID array is rebuilding properly. We can check this with the same command we always use to check RAID status:

cat /proc/mdstat
Checking RAID status after replacing a disk

In the output of the previous screenshot, you can see that the RAID array is in recovery mode. Recovery mode itself can take quite a while to complete, sometimes even overnight depending on how much data it needs to re-sync. This is why it’s very important to replace a RAID disk as soon as possible. Once the recovery is complete, the RAID array is marked healthy and you can now rest easy.

Comments are closed.