ZIL: restore some things lost in "ZIL-crash" review (#17398)#17622
Closed
robn wants to merge 5 commits intoopenzfs:masterfrom
Closed
ZIL: restore some things lost in "ZIL-crash" review (#17398)#17622robn wants to merge 5 commits intoopenzfs:masterfrom
robn wants to merge 5 commits intoopenzfs:masterfrom
Conversation
I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
amotin
approved these changes
Aug 12, 2025
Member
amotin
left a comment
There was a problem hiding this comment.
I did have feeling we have too many concurrent things going on there...
behlendorf
approved these changes
Aug 12, 2025
behlendorf
pushed a commit
that referenced
this pull request
Aug 12, 2025
If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622
behlendorf
pushed a commit
that referenced
this pull request
Aug 12, 2025
Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622
behlendorf
pushed a commit
that referenced
this pull request
Aug 12, 2025
Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622
behlendorf
pushed a commit
that referenced
this pull request
Aug 12, 2025
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622
spauka
pushed a commit
to spauka/zfs
that referenced
this pull request
Aug 30, 2025
I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
spauka
pushed a commit
to spauka/zfs
that referenced
this pull request
Aug 30, 2025
If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
spauka
pushed a commit
to spauka/zfs
that referenced
this pull request
Aug 30, 2025
Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
spauka
pushed a commit
to spauka/zfs
that referenced
this pull request
Aug 30, 2025
Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
spauka
pushed a commit
to spauka/zfs
that referenced
this pull request
Aug 30, 2025
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jan 30, 2026
I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jan 30, 2026
If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jan 30, 2026
Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jan 30, 2026
Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jan 30, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Feb 15, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: codex-5.3-xhigh Porting-notes: clean cherry-pick.
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Feb 15, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: codex-5.3-xhigh Porting-notes: clean cherry-pick.
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Feb 16, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: codex-5.3-xhigh Porting-notes: clean cherry-pick.
lundman
pushed a commit
to openzfsonwindows/openzfs
that referenced
this pull request
Feb 23, 2026
I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonwindows/openzfs
that referenced
this pull request
Feb 23, 2026
If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonwindows/openzfs
that referenced
this pull request
Feb 23, 2026
Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonwindows/openzfs
that referenced
this pull request
Feb 23, 2026
Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
lundman
pushed a commit
to openzfsonwindows/openzfs
that referenced
this pull request
Feb 23, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Mar 1, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: codex-5.3-xhigh Porting-notes: clean cherry-pick.
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Apr 28, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: Pavel Snajdr <snajpa@snajpa.net> Porting-notes: clean cherry-pick.
snajpa
pushed a commit
to vpsfreecz/zfs
that referenced
this pull request
Apr 30, 2026
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17622 Upstream: 531568f Ported-by: Pavel Snajdr <snajpa@snajpa.net> Porting-notes: clean cherry-pick.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
I was reviewing another PR and noticed that the
lwb_flagsfield wasn't there. Looking closer, and I realised that when I rebased in #17398 (comment), I did it on an older work branch that didn't have all of more recent changes from review. This PR restores those.Description
This was done by reading the diff between ef66c53 (the last position of #17398 before my screw-up) and master and carefully reapplying the changes.
Review comments in #17398 have more context.
Refactor/cleanup:
lwb_flags, makelwb_slogandlwb_slimbe flagszil_commit_waiter_skip()to bezil_commit_waiter_done()and take an error, and use itFunctional:
LWB_FLAG_CRASHEDto in-flight LWBs inzil_crash(), and use it to know to take no further action. This is mainly to avoid callinglist_next(&zl_lwb_list, lwb)whenlwbis onzl_lwb_crash_listvia the same node, but being able to clearly identify an LWB as "dead" is generally useful.zil_process_commit_list(), stop trying, and get an error back to unassigned waitersHow Has This Been Tested?
zilandfailmodetest tags succeeded, though none of them really exercise these fairly extreme edge cases.This stuff was all approved in #17398 though so hopefully its not too onerous (provided I didn't screw it up again).
Types of changes
Checklist:
Signed-off-by.