23.4 Adding a Spare and Rebuilding After Disk Failure

Alright, let’s get our hands dirty. Your array is degraded. A drive has thrown in the towel, and the little [U_] in your mdadm --detail output is mocking you. Don’t panic. This is exactly what you built this system for. It’s not a disaster; it’s a feature. Think of it like your car’s “check engine” light—annoying, but a sign the system is smart enough to know something’s wrong. Your job now is to be the mechanic.

First, confirm the situation. Don’t just yank drives based on a feeling. The /proc/mdstat file is your best friend here. It gives you the raw, unvarnished truth.

cat /proc/mdstat

You’ll see something gloriously depressing like: Personalities : [raid1] md0 : active raid1 sda1[0] sdb1[2](F) 976630336 blocks [2/1] [U_]

There it is. [U_] means one device is up, one is… not. The (F) next to sdb1 confirms the kernel has marked it as failed. The array is still running, limping along in a degraded state, serving your data from the remaining good drive. Now, before you do anything else, check your system logs! dmesg -T | grep -i sd or journalctl --since="1 hour ago" can tell you why the drive failed. Was it a read error? Did it just vanish? This is crucial intel for the next step.

Physically Replacing the Drive

If this is a physical server, now you get to play “hot-swap roulette.” Power down if you’re not 100% confident your hardware supports it. The drive you pull should be the one with the little fault LED blinking at you. If all the lights are green and you’re just going by the software alert, double and triple-check you’re pulling the correct one. I’ve seen more than one junior sysadmin power down a rack to replace a perfectly healthy drive because they got the bay number wrong. It’s a rite of passage, but let’s skip it today.

If you’re in a VM or cloud environment, this usually means detaching the bad virtual disk and attaching a new one. The process is different for every provider, but the outcome is the same: you present a new, blank, same-sized (or larger!) block device to the OS.

Letting the Array Rebuild Automatically

Here’s the beautiful part. If you were smart and pre-attached a spare drive to the array when you built it (mdadm --create ... --spare-devices=1), you can just lean back and watch the magic happen. The md driver will have already noticed the failure, kicked the bad drive out, and promoted the spare into the array, starting a rebuild automatically. Check cat /proc/mdstat again and you’ll see a reassuring [=>...] recovery progress bar. Your job here is just to make popcorn and watch iostat to see the glorious rebuild I/O.

But let’s be honest, most of us don’t keep expensive drives sitting around as idle spares. You probably have to add one manually.

Adding a New Spare and Kicking Off the Rebuild

You’ve got your new drive installed. The OS sees it, say, as /dev/sdc1. First, you need to add it to the array as a spare. The mdadm --add command is how you do it.

# Syntax: mdadm --add /dev/mdX /dev/new_device
mdadm --add /dev/md0 /dev/sdc1

Now, because the array is degraded and has a fresh spare, it will immediately start rebuilding. No other command needed. Watch the progress with:

cat /proc/mdstat
# Or for a more persistent watch
watch cat /proc/mdstat

You’ll see the recovery line. It’s a beautiful thing. The kernel is reading every block from the remaining good drive and writing it to the new one. This is a heavy I/O operation. Your system will be slow. This is normal. Don’t try to do performance benchmarking on your database while this is happening.

Monitoring and Best Practices

Let it run. A rebuild can take hours for large drives. You can check the details any time with:

mdadm --detail /dev/md0

Look for the Rebuild Status line. Under no circumstances should you interrupt this process. A power failure or reboot will pause it, and it will resume where it left off, which is fine. The real nightmare scenario is if your one remaining good drive fails during the rebuild. Then you’re proper hosed. This is why you have backups. You do have backups, right?

Once the rebuild is complete, --detail will show all drives as active and spare. The array is whole again. But you’re not done. You must update your config files. The new device needs to be added to your mdadm.conf so the array assembles correctly on the next boot.

# Capture the current array layout. This will include the new device.
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
# Or on some distros
mdadm --detail --scan >> /etc/mdadm.conf

Pro tip: It’s often cleaner to first remove the old, incorrect definition from the file then run the --scan command to append a fresh, correct one. Finally, update your initramfs so the early boot system knows about the new layout: update-initramfs -u. Reboot and verify everything comes up cleanly. Congratulations, you’ve just performed data-saving surgery. Now go buy a spare drive for next time.