draid: fix cksum errors after rebuild with degraded disks#18414
Draft
andriytk wants to merge 1 commit intoopenzfs:masterfrom
Draft
draid: fix cksum errors after rebuild with degraded disks#18414andriytk wants to merge 1 commit intoopenzfs:masterfrom
andriytk wants to merge 1 commit intoopenzfs:masterfrom
Conversation
behlendorf
reviewed
Apr 8, 2026
36e2521 to
26212a3
Compare
Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degrated state. It should be possible to read from those degraded disks in order to reconstruct the data correctly during the rebuild. However, when some draid spare with faulted disk happens to point to such degraded disk, it will be missed by vdev_draid_missing() function since it checks for the rebuilding state of the draid spare first and, of course, it would return true since we are rebuilding this spare atm. This would result with less than needed colums for the correct data rebuild, which will be eventually manifested as cksum errors during scrub. Imagine a situation with draid1 vdev when d1 is resilvered to s1, faulted d2 is being rebuilt to s2, and d3, in degraded state, is being rebuilt to s3 at the same time. If at some offset (slice) s1 hot spare would map to spare with d2+s2, and s2 map to spare with d3+s3, from which we still can read data because d3 is degraded, we would have more than 1 missing columns as vdev_draid_missing(d2+s2) would return true, it won't check d3+s3 to which s2 maps. Solution: in vdev_draid_missing() function, if draid spare points to another spare which is also rebuilding, go and check that spare first. Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
26212a3 to
d7d463d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degrated state. It should be possible to read from those degraded disks in order to reconstruct the data correctly during the rebuild. However, when some draid spare with faulted disk happens to point to such degraded disk, it will be missed by vdev_draid_missing() function since it checks for the rebuilding state of the draid spare first and, of course, it would return true since we are rebuilding this spare atm. This would result with less than needed colums for the correct data rebuild, which will be eventually manifested as cksum errors during scrub.
Imagine a situation with draid1 vdev when d1 is resilvered to s1, faulted d2 is being rebuilt to s2, and d3, in degraded state, is also being rebuilt to s3 at the same time. If at some offset (slice) s1 hot spare would map to spare with d2+s2, and s2 map to spare with d3+s3, from which we still can read data because d3 is degraded, we would have more than 1 missing columns because vdev_draid_missing(d2+s2) would return true, it won't check d3+s3 to which s2 maps.
Solution: in vdev_draid_missing() function, if draid spare points to another spare which is also rebuilding, go and check that spare first.
This is a follow up after #18405.
How Has This Been Tested?
There should be no cksum errors after scrub.
Types of changes
Checklist:
Signed-off-by.