Skip to content

draid: fix cksum errors after rebuild with degraded disks#18414

Draft
andriytk wants to merge 1 commit intoopenzfs:masterfrom
andriytk:fixrebuildcksumerrs
Draft

draid: fix cksum errors after rebuild with degraded disks#18414
andriytk wants to merge 1 commit intoopenzfs:masterfrom
andriytk:fixrebuildcksumerrs

Conversation

@andriytk
Copy link
Copy Markdown
Contributor

@andriytk andriytk commented Apr 8, 2026

Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degrated state. It should be possible to read from those degraded disks in order to reconstruct the data correctly during the rebuild. However, when some draid spare with faulted disk happens to point to such degraded disk, it will be missed by vdev_draid_missing() function since it checks for the rebuilding state of the draid spare first and, of course, it would return true since we are rebuilding this spare atm. This would result with less than needed colums for the correct data rebuild, which will be eventually manifested as cksum errors during scrub.

Imagine a situation with draid1 vdev when d1 is resilvered to s1, faulted d2 is being rebuilt to s2, and d3, in degraded state, is also being rebuilt to s3 at the same time. If at some offset (slice) s1 hot spare would map to spare with d2+s2, and s2 map to spare with d3+s3, from which we still can read data because d3 is degraded, we would have more than 1 missing columns because vdev_draid_missing(d2+s2) would return true, it won't check d3+s3 to which s2 maps.

Solution: in vdev_draid_missing() function, if draid spare points to another spare which is also rebuilding, go and check that spare first.

This is a follow up after #18405.

How Has This Been Tested?

  1. Create a pool with draid3:4s vdev and populate it with some data.
  2. Fail 4 devices (zpool offline -f), zed would start automatic sequential resilver, 3 disks would be faulted while one disk would be degraded (since nparity is 3).
  3. Wait for resilver and scrub to complete.

There should be no cksum errors after scrub.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 8, 2026
@andriytk andriytk force-pushed the fixrebuildcksumerrs branch from 36e2521 to 26212a3 Compare April 8, 2026 09:43
Currently, when more than nparity disks get faulted during the
rebuild, only first nparity disks would go to faulted state, and
all the remaining disks would go to degrated state. It should be
possible to read from those degraded disks in order to reconstruct
the data correctly during the rebuild. However, when some draid
spare with faulted disk happens to point to such degraded disk,
it will be missed by vdev_draid_missing() function since it checks
for the rebuilding state of the draid spare first and, of course,
it would return true since we are rebuilding this spare atm.
This would result with less than needed colums for the correct
data rebuild, which will be eventually manifested as cksum errors
during scrub.

Imagine a situation with draid1 vdev when d1 is resilvered to s1,
faulted d2 is being rebuilt to s2, and d3, in degraded state, is
being rebuilt to s3 at the same time. If at some offset (slice) s1
hot spare would map to spare with d2+s2, and s2 map to spare with
d3+s3, from which we still can read data because d3 is degraded, we
would have more than 1 missing columns as vdev_draid_missing(d2+s2)
would return true, it won't check d3+s3 to which s2 maps.

Solution: in vdev_draid_missing() function, if draid spare points
to another spare which is also rebuilding, go and check that spare
first.

Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
@andriytk andriytk force-pushed the fixrebuildcksumerrs branch from 26212a3 to d7d463d Compare April 8, 2026 09:55
@andriytk andriytk marked this pull request as draft April 8, 2026 19:26
@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Work in Progress Not yet ready for general review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants