23.3 /proc/mdstat and mdadm --detail: Monitoring Array Health

Alright, let’s talk about checking the pulse of your RAID array. You didn’t go through all the trouble of building this digital Voltron just to cross your fingers and hope for the best. You need to know its status, and for that, we have two primary tools: the kernel’s status file, /proc/mdstat, and the Swiss Army knife itself, mdadm --detail. One gives you the quick, at-a-glance view; the other gives you the full medical chart. You’ll use both.

The Kernel’s Bulletin Board: /proc/mdstat

Think of /proc/mdstat as the kernel’s real-time, text-only dashboard for all things MD (Multiple Device). It’s not a static file; it’s a dynamic peek into the kernel’s brain. The beauty of this is its universality—it’s always there, even if you chroot into a broken system or mdadm isn’t installed. Cat it out and let’s see what we’re working with.

cat /proc/mdstat

Here’s a realistic example of a healthy, freshly resynced RAID1 array:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sdc1[0]
      976630336 blocks super 1.2 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

Let’s break this down line by line because it’s telling you a story.

Personalities: This just lists the RAID levels the current kernel supports. Don’t get excited, it doesn’t mean you’re using all of them.
md0 : active raid1 sdb1[1] sdc1[0]: This is the headline. Your array is named md0, it’s active, it’s a raid1, and it’s composed of sdb1 (which it knows as disk slot [1]) and sdc1 (disk slot [0]). The slot numbers are arbitrary identifiers; don’t read into their order.
976630336 blocks super 1.2: The size in blocks and, crucially, the version of the metadata (super 1.2). This is important for recovery scenarios.
[2/2] [UU]: This is the most important part. The first [2/2] means “we expect 2 disks, and we have 2 disks.” The second [UU] is the status of each of those disks. U means up. A _ (underscore) would mean the device is missing. If a disk was failing, you might see F here, or S for a spare that’s been pressed into service.
bitmap: This shows an internal write-intent bitmap is being used. This is a best practice for RAID levels like 5/6/1 that need to resync; it drastically reduces resync time after a crash or reboot by only syncing the parts of the disk that were being written to, instead of the whole thing. The 0/8 pages [0KB] means it’s currently not tracking any dirty regions, which is what you want.

Now, let’s look at a degraded array. This is where you’ll see it most often, hopefully because you’re testing your monitoring and not because of a real failure.

md0 : active raid1 sdb1[1] sdc1[2](F)
      976630336 blocks super 1.2 [2/1] [_U]
      [=========>...........]  recovery = 45.2% (147456000/976630336) finish=42.0min speed=328273K/sec

Oh, drama! sdc1 has an (F) for failed. The [2/1] shows we expect 2 but only have 1 fully active. The [_U] confirms it: first device is down (_), second is up (U). And because we had a spare disk (sdd1) available in the array, mdadm has automatically started a recovery process to rebuild the array onto that spare. It’s kindly giving us an ETA. This is why you have spare disks—so the system fixes itself while you sleep.

The Full Medical Exam: mdadm –detail

While /proc/mdstat is great for a quick glance, mdadm --detail is the comprehensive diagnostic tool. It queries the superblocks on the actual physical devices and gives you a consolidated, human-friendly report. You need to run this as root.

sudo mdadm --detail /dev/md0

The output will look something like this:

/dev/md0:
           Version : 1.2
     Creation Time : Fri Oct 26 13:24:17 2023
        Raid Level : raid1
        Array Size : 976630336 (931.39 GiB 1000.07 GB)
     Used Dev Size : 976630336 (931.39 GiB 1000.07 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon Oct 29 10:15:22 2023
             State : clean, degraded, recovering
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 1

            Layout : near-copy
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 45% complete

              Name : my-server:0  (local to host my-server)
              UUID : f8b00a1d:07c54e81:1d0c3f2a:14d144c2
            Events : 0.45678

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       2       0        0        2      removed
       1       8       33        1      spare rebuilding   /dev/sdc1

       3       8       49        -      spare   /dev/sdd1

This is where you get the gold. Notice it shows Total Devices : 3 and Spare Devices : 1. It gives you a precise Rebuild Status and, most importantly, a table at the bottom showing the exact state of every single device associated with this array. You can see that /dev/sdc1 is a spare rebuilding, which is exactly what we want to see after a failure. The State : clean, degraded, recovering line is a masterpiece of kernel understatement: “Yeah, everything’s fine, but also it’s broken, but don’t worry we’re fixing it.”

Common Pitfalls and Best Practices

The “I see UU, I’m good!” Trap: A [UU] in /proc/mdstat only means the devices are present and the array is synced. It does not mean the underlying disks are healthy. A disk can be silently failing, throwing correctable read errors, and still show U. This is why you must pair your mdadm monitoring with smartctl -a /dev/sdX to check the SMART health of the individual physical drives. mdadm only knows a device has completely failed when the kernel can’t talk to it anymore.
Check Your Email: This is not a joke. mdadm is fantastic at sending email alerts when an array degrades. If you set this up and ignore it, you are doing it wrong. The email will contain the output of --detail, and it’s your first and best line of defense.
Test Your Recovery: You think you know what to do when a disk fails. You don’t. Not until you’ve physically unplugged a drive from a live system and walked through the process of marking it failed, removing it, adding a new one, and watching it rebuild. Do this on a test array first. Your future, panicked self will thank you.
Read the Darn State: The state line in --detail is crucial. clean means everything is in sync. active means it’s working but not necessarily clean (common during a reshape). degraded means you’ve lost a disk. recovering/resyncing is good—it means it’s fixing itself. If you see readonly, panic (a little). It means the array has encountered so many errors it has given up on writing entirely to protect your data.