Blog

When a RAID resync stalled at 99% on my NAS and the bad-sector workaround that finished the rebuild

RAID arrays are the bedrock of data protection for so many of us running home servers or NAS devices. Whether you’re archiving family photos, working files, or even media collections, a robust RAID setup offers peace of mind. But what happens when that trust is shaken—right at 99%—with a resync that just refuses to complete? That was the scenario I recently faced. My NAS RAID rebuild stalled at 99% for hours, and what followed was a deep dive into troubleshooting, disk diagnostics, and ultimately, a workaround involving bad-sector handling that saved the day.

TL;DR

My RAID resync stopped at 99% due to unreadable sectors on one of the drives. After some troubleshooting, I discovered the rebuild was stalling over a few bad blocks. By forcing the system to skip unreadable sectors using a manual bad-sector workaround, I successfully completed the rebuild with minimal data loss. This experience highlights the importance of regular S.M.A.R.T. checks and having a recovery plan in place.

The Setup: RAID 5 on a Four-Bay NAS

My configuration was fairly typical: a custom-built NAS running Linux with four 4TB drives in a RAID 5 array using mdadm. The goal was a balance between performance, redundancy, and capacity. Over the past few years, the system ran flawlessly—no drive drop-outs, no errors, just solid performance.

Things changed when one of the drives began exhibiting S.M.A.R.T. warnings. I preemptively replaced it with a brand-new drive of the exact same model. The rebuild process kicked in automatically, and the array started resyncing. Everything seemed fine until the rebuild hit 99%, and then… nothing. No progress for hours. The resync had stalled, and the logs weren’t telling a clear story.

Digging Into the Problem

When a RAID rebuild stalls without logs reporting any fatal errors, it often points to a subtle hardware or parity mismatch issue. I took several steps to isolate the root cause:

  • Ran smartctl on all drives – One of the remaining old drives had relocated sectors and pending uncorrectable sectors.
  • Checked dmesg and /var/log/syslog – These confirmed I/O errors on specific block addresses.
  • Used mdadm --detail /dev/md0 – This showed the array was in ‘resyncing’ state but stuck at the same block number.

These clues led me to conclude that the rebuild was hitting one or more bad sectors—common on aging drives—and getting stuck trying endlessly to read them.

Understanding RAID’s Sensitivity to Read Errors

RAID 5 requires all drives to participate correctly during a rebuild. If one drive can’t read a particular sector (even if it’s not being replaced), the controller can’t compute the parity for that stripe and will halt or fail the rebuild process. This is a known Achilles’ heel of RAID 5: a read error during a rebuild is as good as data loss.

But not all is lost. In cases like mine, the drive may only have a few unreadable sectors, and if we can encourage the OS (Linux, in this case) to skip over those instead of sticking, we can avoid total failure.

The Bad Sector Workaround That Saved the Day

After identifying the blocks where the resync was stalling, I used a runnable method involving dd and hdparm to tell the system to ignore or remap those sectors.

  1. First, I got the failing sector addresses from the kernel logs.
  2. I then used this command to manually attempt a read and force reallocation:
    dd if=/dev/sdX of=/dev/null bs=512 skip=SECTOR count=1
  3. In cases where dd wouldn’t succeed, I used hdparm --read-sector to read the sector at a low level. The goal was to either retrieve the data or let the drive remap it from its internal reserve.
  4. After successfully coaxing the drive past the bad sectors or marking them as unreadable, the rebuild resumed—and completed.

Note: be extremely cautious using these tools. A mistyped command can cause data loss. Always double-check device names and sector numbers.

What I Learned From the Experience

This resync stall turned into a weekend-long learning experience. Here are a few key takeaways:

  • RAID is not a backup. If my workaround hadn’t worked, I could have lost data. This was a wake-up call to investigate offsite or cloud backups.
  • Monitor your drives proactively using S.M.A.R.T. tools and schedule regular tests. Early detection makes a huge difference.
  • Don’t ignore single-drive warnings. Even though RAID 5 can tolerate one disk failure, it can’t tolerate degraded performance without consequences.
  • Create a test NAS at home to experiment with failure scenarios in a safe way. Knowing how RAID handles these problems is crucial before they happen in production.

Preventive Strategies for the Future

To reduce the chances of a repeat incident, I’ve applied the following improvements to my storage setup:

  • Scheduled monthly S.M.A.R.T. long tests with email alerts for any anomalies.
  • Enabled TLER (Time-Limited Error Recovery) on SATA drives if supported—to ensure the drive gives up quickly and lets the RAID handle the failure.
  • Replaced aging drives in a staggered fashion rather than waiting for S.M.A.R.T. failures to pop up.
  • Considering moving to ZFS for future rebuilds, which handles errors and bad blocks more gracefully than mdadm RAID 5.

When to Consider Retiring a Drive

Even though I nursed the array back to health, I didn’t fully trust the drive that caused the trouble. After it was clear the issue came down to a handful of weak sectors, I backed up the entire array and permanently removed the drive.

The warning signs to look for:

  • More than a few reallocated sectors
  • Any increase in pending or uncorrectable sectors
  • Extended read times or kernel I/O errors

Hard drives are mechanical devices and do fail. If your drive starts developing bad sectors—even if they’re remapped successfully—it’s only a matter of time.

Conclusion

This experience underscored how fragile data integrity can be when working with RAID arrays and how vital it is to pair RAID with diagnostics and backup strategies. Had I not investigated the log messages and taken proactive steps, the rebuild would never have completed, putting my data at risk. With a few informed tools and commands, it’s often possible to outmaneuver even stubborn rebuild issues like bad sectors—at least long enough to recover or migrate data safely.

In technology, we often say, “Hope for the best, but prepare for the worst.” RAID gives us the illusion of redundancy, but it doesn’t guarantee recoverability unless we stay proactive. Monitor, test, and backup. Especially when you’re at 99%—and counting.

To top