ZTS: resilver_restart_001 improvements#18434
Merged
behlendorf merged 1 commit intoopenzfs:masterfrom Apr 16, 2026
Merged
Conversation
04d8bff to
b97858d
Compare
There was a problem hiding this comment.
Pull request overview
Improves reliability of the resilver_restart_001 ZTS test by reducing races and making vdev state transitions deterministic, aiming to eliminate intermittent CI failures.
Changes:
- Switch test pool topology to
raidz2to better tolerate multiple device disruptions during the test. - Reduce replace/suspend timing races by setting
SCAN_SUSPEND_PROGRESSbefore startingzpool replace. - Add explicit waits for vdev state transitions and force TXG syncs (
sync_pool ... true) to stabilize sequencing; dumpzpool eventsduring cleanup for debugging.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The resilver_restart_001 test case has not been entirely reliable when run under the CI. Address several small issues which may be responsible. - Configure the pool as raidz2 instead of raidz1 since the test offlines two devices. This ensures the second device is marked as OFFLINE instead of DEGRADED. - Start the zpool replace after setting SCAN_SUSPEND_PROGRESS to close any potential race where the replace finishs to quickly. - Wait for the offlines/onlined vdevs to fully transition to the expected state during the test. - Add the true flag to sync_pool to force a TXG sync to happen even if it might not otherwise be required. - During cleanup dump the zpool events history to aid debugging if the updated test case is still unreliable in the CI. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
b97858d to
cbda331
Compare
tonyhutter
approved these changes
Apr 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Resolve the occasional CI failures for this test.
https://github.com/openzfs/zfs/actions/runs/24217052638/job/70700070007?pr=18387
Description
The resilver_restart_001 test case has not been entirely reliable when run under the CI. Address several small issues which may be responsible.
Configure the pool as raidz2 instead of raidz1 since the test offlines two devices. This ensures the second device is marked as OFFLINE instead of DEGRADED.
Start the zpool replace after setting SCAN_SUSPEND_PROGRESS to close any potential race where the replace finishs to quickly.
Wait for the offlines/onlined vdevs to fully transition to the expected state during the test.
Add the true flag to sync_pool to force a TXG sync to happen even if it might not otherwise be required.
During cleanup dump the zpool events history to aid debugging if the updated test case is still unreliable in the CI.
How Has This Been Tested?
Tested locally, but will be verified by the CI.
Types of changes