Anyraid phase 2; raidz, rebalance, and contraction#18406
Open
pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
Open
Anyraid phase 2; raidz, rebalance, and contraction#18406pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
Conversation
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Primarily augmenting the vdev_anyraid_mapped logic to be txg-aware. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
There are a few changes needed for this, but none is major. Genericizing the existing anyraid code, adding a concept of "width" to account for data columns as well as parity, and adding some helper functions. The main work is mostly adding the interface to the vdev_raidz code in vdev_anyraid.c. We need to redo some of the raidz map creation process to correct the offsets and child IDs of the columns in the map. We also need to implement a couple more of the vdev_ops functions that raidz needs that mirrors don't, and handle both cases properly. Finally, we add some anyraidz cases to the tests. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
This patch adds a new command and functionality to the AnyRAID vdev type. Currently, when new devices are added to an AnyRAID vdev, they have no tiles in use. This is not ideal from both a performance perspective (since the new device won't be serving any read traffic) or a space efficiency perspective (because the new device needs to work with other devices to store stripes of tiles). What we would like to do is be able to move some of the tiles from existing children to the new device, to better balance data across the disks. That is what the rebalance command is designed to do. Rebalance operates in three phases. First, a plan is generated that details which tiles will be moved to a new location. Second, that plan is executed; each tile is moved in turn. Finally, a scrub is run to verify that the rebelance completed successfully and all data is intact. Until that final scrub occurs, no data is actually written to the old tile locations, to ensure that the data can be recovered if something goes wrong. Plan generation works by considering each vdev and determining if they are under/overloaded, based on the number of tiles allocated overall and the size of each vdev. Then, we consider the tiles on the overloaded vdevs and see which ones can be moved to the underloaded vdevs. This algorithm is not optimal; we use a greedy algorithm, which works fine in practice. More complex algorithms that could rollback and retry earlier decisions to produce a more optimal outcome can be implemented in the future. The actual data movement process reuses much of the logic from raidz expansion. An async thread is created that issues the reads and writes, and rangelocks are used to protect the data that is actively being moved from concurrent modification/access. Progress is synced every txg, so the process can easily resume if the system restarts. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
Contraction is the last piece of the puzzle for allowing AnyRAID to be flexible in terms of allowing users to add and use storage to their vdevs as they wish. Thus far, it has not been possible to shrink the logical size of a vdev in ZFS; mirrors can be detached, but that doesn't actually reduce the available space in the pool, just the amount of parity in the mirror. Device removal removes a whole top-level vdev, but doesn't shrink an individual vdev. Contraction is different; it actually shrinks the top-level vdev but leaves it in place. This works by taking all the tiles on the leaf vdev being removed and moving them to other devices in the anyraid vdev. This does impose some restrictions, since if we cannot find a way to move all the tiles to other leaf vdevs, the contraction cannot happen without reducing redundancy. This plan generation step works similarly to rebalance, and could be improved in the future with more advanced algorithms. The second an third phases of contraction are also similar to rebalance, and much of the code is shared between the two. The final phase is the actual shrinking of the top-level vdev, which only takes a single txg. In order to shrink a top-level vdev we need to ensure that no data is present in the metaslabs that will be removed when the asize is reduced; if there is anything allocated in those regions we fail to contract, even if that data could be moved. Future work may allow for a device-removal-like remapping of that data before contraction occurs. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
svistoi
reviewed
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: only the final 4 commits are part of this review, all previous commits are part of #17567 , which is a prerequisite for this PR.
Sponsored by: Eshtek, creators of HexOS
Sponsored by: Klara, Inc.
Motivation and Context
AnyRAID Phase 1 introduced the AnyRAID architecture and added support for AnyRAID vdevs with mirror-style parity. Additional work, however, is required to support raidz-style parity as well as other features that the AnyRAID architecture can support, like rebalancing tiles between disks and removing disks from the AnyRAID vdev.
Description
This PR has a few sets of changes.
The first is a set of changes that fix small issues in the AnyRAID code and generally prepare for later patches.
The second is the raidz-style parity option for AnyRAID. This is implemented as a separate set of vdev ops that can be selected with the appropriate command line arguments.
anyraidzhas an additional parameter toanymirror; in addition to a parity count, it also has a data width.anyraidzX:Ywill store data in a similar layout to araidzXvdev withYchild disks; any additional disks in the AnyRAID vdev will be use to balance the distribution of tiles and provide improved space utilization. TheanyraidzIO code mostly leverages the existing raidz code to construct the raidz map and dispatch the child IOs, but it does have an additional step where the offsets and disk IDs are modified to take into account the indirection of the tile map. In addition to the core changes to enable raidz parity, there are some other changes to the AnyRAID code to take into account that stripe widths are no longer just nparity + 1, and to more easily check if a vdev is an AnyRAID vdev of either type elsewhere in the kernel. Finally, there are additional unit tests and ztest support for raidz-style parity.The next commit adds the rebalance functionality to AnyRAID. When a new device is added to an AnyRAID vdev, the new device has no allocated tiles. If there are sufficient free tiles left in the existing vdevs to take advantage of all the new space, that's not a problem (mostly; we still aren't getting to use the IOps/bandwidth of the new device until tiles are allocated on it). If, however, the vdev was very close to full, then the new device will not be able to be fully utilized. Rebalance solves this problem by selecting tiles from existing vdevs and moving them to the newly empty device.
Rebalance borrows some of its architecture from the raidz expansion feature. First, a plan is generated that moves tiles from fuller disks to emptier disks. We repeatedly consider the fullest disk, select an allocated tile on that disk, and find a new place to move it. If there is nowhere to move it that is less full than the source disk, we must have finished the balancing process. Note that there is some additional logic here to prevent and handle attempting to move tiles such that multiple tiles from the same stripe would be on the save physiacl device. Once the plan is generated, it is persisted into a MOS object. Then, the relocate thread is started. The relocate thread is much like the raidz expansion thread; it runs in the background, issuing IO and completing tasks. Each TXG its progress is synced out. Once the tasks have all been completed, the relocation is marked as complete. At that point a scrub is started, to check that all of the data was moved correctly. Until that scrub completes, we don't actually free the tiles that were previously in use, so that we can attempt data recovery if something went badly wrong.
The rebalance commit also includes some tests and ztest support.
The final commit is the contraction feature. Contraction solves the problem of wanting to be able to remove devices from an anyraid vdev. In cases where storage capacity is no longer needed or a device is beginning to show signs of age but replacements are not available, the ability to fully remove a child device from the AnyRAID vdev is valuable.
Contraction has a similar overall architecture to rebalance, with a few caveats. The first is that if a plan cannot be generated to move all of the tiles off of the device being removed, we have to fail the contraction. This is not expected to be a major issue in practice unless the vdev is nearly full. There can be cases where this happens using the implemented algorithm but a different algorithm would be able to find a valid solution; for now, the greedy algorithm does a good enough job that this isn't necessary to fix, but this could be an area of future work. Second, because contraction actually reduces the allocatable size of a vdev, there are restrictions around that. The vdev cannot be contracted if there is data allocated in the tail area of the vdev that would eventually not be present after contraction would complete, since that data needs to be moved first. In the future, a mini-device-removal process could allow that data to be moved if sufficient free space exists in earlier tiles. Finally, because contraction requires an actual change to the vdev configuration, it cannot be performed if there is a checkpoint.
The contraction patch contains the plan generation logic for contraction, as well as some other changes around starting and managing a contraction. It also contains changes to the vdev logic necessary to allow for shrinking a top-level vdev, something that was not previously possible. Finally, zfs-test and ztest support are added for contraction.
How Has This Been Tested?
There are a variety of new zfs-test suite tests that were written for this project. In addition, ztest support was added for many of the new features. Both were run repeatedly and extensively to find and resolve issues.
Types of changes
Checklist:
Signed-off-by.