23.3 /proc/mdstat and mdadm --detail: Monitoring Array Health
Alright, let’s talk about checking the pulse of your RAID array. You didn’t go through all the trouble of building this digital Voltron just to cross your fingers and hope for the best. You need to know its status, and for that, we have two primary tools: the kernel’s status file, /proc/mdstat, and the Swiss Army knife itself, mdadm --detail. One gives you the quick, at-a-glance view; the other gives you the full medical chart. You’ll use both.
The Kernel’s Bulletin Board: /proc/mdstat
Think of /proc/mdstat as the kernel’s real-time, text-only dashboard for all things MD (Multiple Device). It’s not a static file; it’s a dynamic peek into the kernel’s brain. The beauty of this is its universality—it’s always there, even if you chroot into a broken system or mdadm isn’t installed. Cat it out and let’s see what we’re working with.
cat /proc/mdstat
Here’s a realistic example of a healthy, freshly resynced RAID1 array:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sdc1[0]
976630336 blocks super 1.2 [2/2] [UU]
bitmap: 0/8 pages [0KB], 65536KB chunk
Let’s break this down line by line because it’s telling you a story.
- Personalities: This just lists the RAID levels the current kernel supports. Don’t get excited, it doesn’t mean you’re using all of them.
- md0 : active raid1 sdb1[1] sdc1[0]: This is the headline. Your array is named
md0, it’sactive, it’s araid1, and it’s composed ofsdb1(which it knows as disk slot[1]) andsdc1(disk slot[0]). The slot numbers are arbitrary identifiers; don’t read into their order. - 976630336 blocks super 1.2: The size in blocks and, crucially, the version of the metadata (
super 1.2). This is important for recovery scenarios. - [2/2] [UU]: This is the most important part. The first
[2/2]means “we expect 2 disks, and we have 2 disks.” The second[UU]is the status of each of those disks.Umeans up. A_(underscore) would mean the device is missing. If a disk was failing, you might seeFhere, orSfor a spare that’s been pressed into service. - bitmap: This shows an internal write-intent bitmap is being used. This is a best practice for RAID levels like 5/6/1 that need to resync; it drastically reduces resync time after a crash or reboot by only syncing the parts of the disk that were being written to, instead of the whole thing. The
0/8 pages [0KB]means it’s currently not tracking any dirty regions, which is what you want.
Now, let’s look at a degraded array. This is where you’ll see it most often, hopefully because you’re testing your monitoring and not because of a real failure.
md0 : active raid1 sdb1[1] sdc1[2](F)
976630336 blocks super 1.2 [2/1] [_U]
[=========>...........] recovery = 45.2% (147456000/976630336) finish=42.0min speed=328273K/sec
Oh, drama! sdc1 has an (F) for failed. The [2/1] shows we expect 2 but only have 1 fully active. The [_U] confirms it: first device is down (_), second is up (U). And because we had a spare disk (sdd1) available in the array, mdadm has automatically started a recovery process to rebuild the array onto that spare. It’s kindly giving us an ETA. This is why you have spare disks—so the system fixes itself while you sleep.
The Full Medical Exam: mdadm –detail
While /proc/mdstat is great for a quick glance, mdadm --detail is the comprehensive diagnostic tool. It queries the superblocks on the actual physical devices and gives you a consolidated, human-friendly report. You need to run this as root.
sudo mdadm --detail /dev/md0
The output will look something like this:
/dev/md0:
Version : 1.2
Creation Time : Fri Oct 26 13:24:17 2023
Raid Level : raid1
Array Size : 976630336 (931.39 GiB 1000.07 GB)
Used Dev Size : 976630336 (931.39 GiB 1000.07 GB)
Raid Devices : 2
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Mon Oct 29 10:15:22 2023
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1
Layout : near-copy
Chunk Size : 512K
Consistency Policy : bitmap
Rebuild Status : 45% complete
Name : my-server:0 (local to host my-server)
UUID : f8b00a1d:07c54e81:1d0c3f2a:14d144c2
Events : 0.45678
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 0 0 2 removed
1 8 33 1 spare rebuilding /dev/sdc1
3 8 49 - spare /dev/sdd1
This is where you get the gold. Notice it shows Total Devices : 3 and Spare Devices : 1. It gives you a precise Rebuild Status and, most importantly, a table at the bottom showing the exact state of every single device associated with this array. You can see that /dev/sdc1 is a spare rebuilding, which is exactly what we want to see after a failure. The State : clean, degraded, recovering line is a masterpiece of kernel understatement: “Yeah, everything’s fine, but also it’s broken, but don’t worry we’re fixing it.”
Common Pitfalls and Best Practices
- The “I see UU, I’m good!” Trap: A
[UU]in/proc/mdstatonly means the devices are present and the array is synced. It does not mean the underlying disks are healthy. A disk can be silently failing, throwing correctable read errors, and still showU. This is why you must pair yourmdadmmonitoring withsmartctl -a /dev/sdXto check the SMART health of the individual physical drives.mdadmonly knows a device has completely failed when the kernel can’t talk to it anymore. - Check Your Email: This is not a joke.
mdadmis fantastic at sending email alerts when an array degrades. If you set this up and ignore it, you are doing it wrong. The email will contain the output of--detail, and it’s your first and best line of defense. - Test Your Recovery: You think you know what to do when a disk fails. You don’t. Not until you’ve physically unplugged a drive from a live system and walked through the process of marking it failed, removing it, adding a new one, and watching it rebuild. Do this on a test array first. Your future, panicked self will thank you.
- Read the Darn State: The state line in
--detailis crucial.cleanmeans everything is in sync.activemeans it’s working but not necessarily clean (common during a reshape).degradedmeans you’ve lost a disk.recovering/resyncingis good—it means it’s fixing itself. If you seereadonly, panic (a little). It means the array has encountered so many errors it has given up on writing entirely to protect your data.