Anyraid phase 2; raidz, rebalance, and contraction by pcd1193182 · Pull Request #18406 · openzfs/zfs

pcd1193182 · 2026-04-07T00:17:15Z

Note: only the final 4 commits are part of this review, all previous commits are part of #17567 , which is a prerequisite for this PR.

Sponsored by: Eshtek, creators of HexOS
Sponsored by: Klara, Inc.

Motivation and Context

AnyRAID Phase 1 introduced the AnyRAID architecture and added support for AnyRAID vdevs with mirror-style parity. Additional work, however, is required to support raidz-style parity as well as other features that the AnyRAID architecture can support, like rebalancing tiles between disks and removing disks from the AnyRAID vdev.

Description

This PR has a few sets of changes.

The first is a set of changes that fix small issues in the AnyRAID code and generally prepare for later patches.

The second is the raidz-style parity option for AnyRAID. This is implemented as a separate set of vdev ops that can be selected with the appropriate command line arguments. anyraidz has an additional parameter to anymirror; in addition to a parity count, it also has a data width. anyraidzX:Y will store data in a similar layout to a raidzX vdev with Y child disks; any additional disks in the AnyRAID vdev will be use to balance the distribution of tiles and provide improved space utilization. The anyraidz IO code mostly leverages the existing raidz code to construct the raidz map and dispatch the child IOs, but it does have an additional step where the offsets and disk IDs are modified to take into account the indirection of the tile map. In addition to the core changes to enable raidz parity, there are some other changes to the AnyRAID code to take into account that stripe widths are no longer just nparity + 1, and to more easily check if a vdev is an AnyRAID vdev of either type elsewhere in the kernel. Finally, there are additional unit tests and ztest support for raidz-style parity.

The next commit adds the rebalance functionality to AnyRAID. When a new device is added to an AnyRAID vdev, the new device has no allocated tiles. If there are sufficient free tiles left in the existing vdevs to take advantage of all the new space, that's not a problem (mostly; we still aren't getting to use the IOps/bandwidth of the new device until tiles are allocated on it). If, however, the vdev was very close to full, then the new device will not be able to be fully utilized. Rebalance solves this problem by selecting tiles from existing vdevs and moving them to the newly empty device.

Rebalance borrows some of its architecture from the raidz expansion feature. First, a plan is generated that moves tiles from fuller disks to emptier disks. We repeatedly consider the fullest disk, select an allocated tile on that disk, and find a new place to move it. If there is nowhere to move it that is less full than the source disk, we must have finished the balancing process. Note that there is some additional logic here to prevent and handle attempting to move tiles such that multiple tiles from the same stripe would be on the save physiacl device. Once the plan is generated, it is persisted into a MOS object. Then, the relocate thread is started. The relocate thread is much like the raidz expansion thread; it runs in the background, issuing IO and completing tasks. Each TXG its progress is synced out. Once the tasks have all been completed, the relocation is marked as complete. At that point a scrub is started, to check that all of the data was moved correctly. Until that scrub completes, we don't actually free the tiles that were previously in use, so that we can attempt data recovery if something went badly wrong.

The rebalance commit also includes some tests and ztest support.

The final commit is the contraction feature. Contraction solves the problem of wanting to be able to remove devices from an anyraid vdev. In cases where storage capacity is no longer needed or a device is beginning to show signs of age but replacements are not available, the ability to fully remove a child device from the AnyRAID vdev is valuable.

Contraction has a similar overall architecture to rebalance, with a few caveats. The first is that if a plan cannot be generated to move all of the tiles off of the device being removed, we have to fail the contraction. This is not expected to be a major issue in practice unless the vdev is nearly full. There can be cases where this happens using the implemented algorithm but a different algorithm would be able to find a valid solution; for now, the greedy algorithm does a good enough job that this isn't necessary to fix, but this could be an area of future work. Second, because contraction actually reduces the allocatable size of a vdev, there are restrictions around that. The vdev cannot be contracted if there is data allocated in the tail area of the vdev that would eventually not be present after contraction would complete, since that data needs to be moved first. In the future, a mini-device-removal process could allow that data to be moved if sufficient free space exists in earlier tiles. Finally, because contraction requires an actual change to the vdev configuration, it cannot be performed if there is a checkpoint.

The contraction patch contains the plan generation logic for contraction, as well as some other changes around starting and managing a contraction. It also contains changes to the vdev logic necessary to allow for shrinking a top-level vdev, something that was not previously possible. Finally, zfs-test and ztest support are added for contraction.

How Has This Been Tested?

There are a variety of new zfs-test suite tests that were written for this project. In addition, ztest support was added for many of the new features. Both were run repeatedly and extensively to find and resolve issues.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Primarily augmenting the vdev_anyraid_mapped logic to be txg-aware. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>

There are a few changes needed for this, but none is major. Genericizing the existing anyraid code, adding a concept of "width" to account for data columns as well as parity, and adding some helper functions. The main work is mostly adding the interface to the vdev_raidz code in vdev_anyraid.c. We need to redo some of the raidz map creation process to correct the offsets and child IDs of the columns in the map. We also need to implement a couple more of the vdev_ops functions that raidz needs that mirrors don't, and handle both cases properly. Finally, we add some anyraidz cases to the tests. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>

This patch adds a new command and functionality to the AnyRAID vdev type. Currently, when new devices are added to an AnyRAID vdev, they have no tiles in use. This is not ideal from both a performance perspective (since the new device won't be serving any read traffic) or a space efficiency perspective (because the new device needs to work with other devices to store stripes of tiles). What we would like to do is be able to move some of the tiles from existing children to the new device, to better balance data across the disks. That is what the rebalance command is designed to do. Rebalance operates in three phases. First, a plan is generated that details which tiles will be moved to a new location. Second, that plan is executed; each tile is moved in turn. Finally, a scrub is run to verify that the rebelance completed successfully and all data is intact. Until that final scrub occurs, no data is actually written to the old tile locations, to ensure that the data can be recovered if something goes wrong. Plan generation works by considering each vdev and determining if they are under/overloaded, based on the number of tiles allocated overall and the size of each vdev. Then, we consider the tiles on the overloaded vdevs and see which ones can be moved to the underloaded vdevs. This algorithm is not optimal; we use a greedy algorithm, which works fine in practice. More complex algorithms that could rollback and retry earlier decisions to produce a more optimal outcome can be implemented in the future. The actual data movement process reuses much of the logic from raidz expansion. An async thread is created that issues the reads and writes, and rangelocks are used to protect the data that is actively being moved from concurrent modification/access. Progress is synced every txg, so the process can easily resume if the system restarts. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>

Contraction is the last piece of the puzzle for allowing AnyRAID to be flexible in terms of allowing users to add and use storage to their vdevs as they wish. Thus far, it has not been possible to shrink the logical size of a vdev in ZFS; mirrors can be detached, but that doesn't actually reduce the available space in the pool, just the amount of parity in the mirror. Device removal removes a whole top-level vdev, but doesn't shrink an individual vdev. Contraction is different; it actually shrinks the top-level vdev but leaves it in place. This works by taking all the tiles on the leaf vdev being removed and moving them to other devices in the anyraid vdev. This does impose some restrictions, since if we cannot find a way to move all the tiles to other leaf vdevs, the contraction cannot happen without reducing redundancy. This plan generation step works similarly to rebalance, and could be improved in the future with more advanced algorithms. The second an third phases of contraction are also similar to rebalance, and much of the code is shared between the two. The final phase is the actual shrinking of the top-level vdev, which only takes a single txg. In order to shrink a top-level vdev we need to ensure that no data is present in the metaslabs that will be removed when the asize is reduced; if there is anything allocated in those regions we fail to contract, even if that data could be moved. Future work may allow for a device-removal-like remapping of that data before contraction occurs. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Contributions-by: JT Pennington <jt.pennington@klarasystems.com>

tests/zfs-tests/tests/Makefile.am

pcd1193182 added 30 commits February 26, 2026 12:09

Refactor mirror map into header for use in other code

0884f1c

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add sync_extra logic for anyraid to use

7dc0b45

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add weight biasing to segment based metaslabs

3fe3991

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Change vdev ops to support anyraid

9e2b642

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

New spa_misc functions for anyraid

caeea96

Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Anyraid implementation

174639f

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Implement rebuild support

f133c0c

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add support for anyraid in vdev properties

9841b37

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

Add man page entry

6be21be

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.

improve byteswap logic

a69c08b

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Use zinject to try to make test fully reliable

24e2ce7

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Final byteswap handling

f56afe8

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Tony's feedback

4da87c3

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Fix test failures

18e946e

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

fix printing layout

7b517bb

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Tony feedback

9dab9ea

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Move two of the map copies to the end of the disk

a32e2b3

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

fix zdb arg and checkpoint test

4491437

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

punch holes in loopbacks

42c9e1f

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Add assertion to satisfy codeql

b2d0ca6

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Fix test bugs

15226fa

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Don't allow sizes slightly larger than a tile

4d5de16

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Add better errors for small disks

84f50d0

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

fix define

3247fae

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

abi

1bbc290

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Allow arbitrary mirror width

3858e9d

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Tony's feedback

09622d0

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

man page update

5434a5b

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

clean up test

7d4b57c

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Alek's feedback

543dd45

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

pcd1193182 added 4 commits April 6, 2026 17:15

svistoi reviewed Apr 8, 2026

View reviewed changes

tests/zfs-tests/tests/Makefile.am Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyraid phase 2; raidz, rebalance, and contraction#18406

Anyraid phase 2; raidz, rebalance, and contraction#18406
pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
KlaraSystems:anyraid_phase2

pcd1193182 commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pcd1193182 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pcd1193182 commented Apr 7, 2026 •

edited

Loading