Skip to content

Anyraid phase 2; raidz, rebalance, and contraction#18406

Open
pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
KlaraSystems:anyraid_phase2
Open

Anyraid phase 2; raidz, rebalance, and contraction#18406
pcd1193182 wants to merge 34 commits intoopenzfs:masterfrom
KlaraSystems:anyraid_phase2

Conversation

@pcd1193182
Copy link
Copy Markdown
Contributor

@pcd1193182 pcd1193182 commented Apr 7, 2026

Note: only the final 4 commits are part of this review, all previous commits are part of #17567 , which is a prerequisite for this PR.

Sponsored by: Eshtek, creators of HexOS
Sponsored by: Klara, Inc.

Motivation and Context

AnyRAID Phase 1 introduced the AnyRAID architecture and added support for AnyRAID vdevs with mirror-style parity. Additional work, however, is required to support raidz-style parity as well as other features that the AnyRAID architecture can support, like rebalancing tiles between disks and removing disks from the AnyRAID vdev.

Description

This PR has a few sets of changes.

The first is a set of changes that fix small issues in the AnyRAID code and generally prepare for later patches.

The second is the raidz-style parity option for AnyRAID. This is implemented as a separate set of vdev ops that can be selected with the appropriate command line arguments. anyraidz has an additional parameter to anymirror; in addition to a parity count, it also has a data width. anyraidzX:Y will store data in a similar layout to a raidzX vdev with Y child disks; any additional disks in the AnyRAID vdev will be use to balance the distribution of tiles and provide improved space utilization. The anyraidz IO code mostly leverages the existing raidz code to construct the raidz map and dispatch the child IOs, but it does have an additional step where the offsets and disk IDs are modified to take into account the indirection of the tile map. In addition to the core changes to enable raidz parity, there are some other changes to the AnyRAID code to take into account that stripe widths are no longer just nparity + 1, and to more easily check if a vdev is an AnyRAID vdev of either type elsewhere in the kernel. Finally, there are additional unit tests and ztest support for raidz-style parity.

The next commit adds the rebalance functionality to AnyRAID. When a new device is added to an AnyRAID vdev, the new device has no allocated tiles. If there are sufficient free tiles left in the existing vdevs to take advantage of all the new space, that's not a problem (mostly; we still aren't getting to use the IOps/bandwidth of the new device until tiles are allocated on it). If, however, the vdev was very close to full, then the new device will not be able to be fully utilized. Rebalance solves this problem by selecting tiles from existing vdevs and moving them to the newly empty device.

Rebalance borrows some of its architecture from the raidz expansion feature. First, a plan is generated that moves tiles from fuller disks to emptier disks. We repeatedly consider the fullest disk, select an allocated tile on that disk, and find a new place to move it. If there is nowhere to move it that is less full than the source disk, we must have finished the balancing process. Note that there is some additional logic here to prevent and handle attempting to move tiles such that multiple tiles from the same stripe would be on the save physiacl device. Once the plan is generated, it is persisted into a MOS object. Then, the relocate thread is started. The relocate thread is much like the raidz expansion thread; it runs in the background, issuing IO and completing tasks. Each TXG its progress is synced out. Once the tasks have all been completed, the relocation is marked as complete. At that point a scrub is started, to check that all of the data was moved correctly. Until that scrub completes, we don't actually free the tiles that were previously in use, so that we can attempt data recovery if something went badly wrong.

The rebalance commit also includes some tests and ztest support.

The final commit is the contraction feature. Contraction solves the problem of wanting to be able to remove devices from an anyraid vdev. In cases where storage capacity is no longer needed or a device is beginning to show signs of age but replacements are not available, the ability to fully remove a child device from the AnyRAID vdev is valuable.

Contraction has a similar overall architecture to rebalance, with a few caveats. The first is that if a plan cannot be generated to move all of the tiles off of the device being removed, we have to fail the contraction. This is not expected to be a major issue in practice unless the vdev is nearly full. There can be cases where this happens using the implemented algorithm but a different algorithm would be able to find a valid solution; for now, the greedy algorithm does a good enough job that this isn't necessary to fix, but this could be an area of future work. Second, because contraction actually reduces the allocatable size of a vdev, there are restrictions around that. The vdev cannot be contracted if there is data allocated in the tail area of the vdev that would eventually not be present after contraction would complete, since that data needs to be moved first. In the future, a mini-device-removal process could allow that data to be moved if sufficient free space exists in earlier tiles. Finally, because contraction requires an actual change to the vdev configuration, it cannot be performed if there is a checkpoint.

The contraction patch contains the plan generation logic for contraction, as well as some other changes around starting and managing a contraction. It also contains changes to the vdev logic necessary to allow for shrinking a top-level vdev, something that was not previously possible. Finally, zfs-test and ztest support are added for contraction.

How Has This Been Tested?

There are a variety of new zfs-test suite tests that were written for this project. In addition, ztest support was added for many of the new features. Both were run repeatedly and extensively to find and resolve issues.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Primarily augmenting the vdev_anyraid_mapped logic to be txg-aware.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
There are a few changes needed for this, but none is major. Genericizing
the existing anyraid code, adding a concept of "width" to account for
data columns as well as parity, and adding some helper functions.  The
main work is mostly adding the interface to the vdev_raidz code in
vdev_anyraid.c. We need to redo some of the raidz map creation process
to correct the offsets and child IDs of the columns in the map. We also
need to implement a couple more of the vdev_ops functions that raidz
needs that mirrors don't, and handle both cases properly. Finally, we
add some anyraidz cases to the tests.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
This patch adds a new command and functionality to the AnyRAID vdev
type. Currently, when new devices are added to an AnyRAID vdev, they
have no tiles in use. This is not ideal from both a performance
perspective (since the new device won't be serving any read traffic) or
a space efficiency perspective (because the new device needs to work
with other devices to store stripes of tiles). What we would like to do
is be able to move some of the tiles from existing children to the new
device, to better balance data across the disks. That is what the
rebalance command is designed to do.

Rebalance operates in three phases. First, a plan is generated that
details which tiles will be moved to a new location. Second, that plan
is executed; each tile is moved in turn. Finally, a scrub is run to
verify that the rebelance completed successfully and all data is
intact. Until that final scrub occurs, no data is actually written to
the old tile locations, to ensure that the data can be recovered if
something goes wrong.

Plan generation works by considering each vdev and determining if they
are under/overloaded, based on the number of tiles allocated overall and
the size of each vdev. Then, we consider the tiles on the overloaded
vdevs and see which ones can be moved to the underloaded vdevs. This
algorithm is not optimal; we use a greedy algorithm, which works fine in
practice. More complex algorithms that could rollback and retry earlier
decisions to produce a more optimal outcome can be implemented in the
future.

The actual data movement process reuses much of the logic from raidz
expansion. An async thread is created that issues the reads and writes,
and rangelocks are used to protect the data that is actively being moved
from concurrent modification/access. Progress is synced every txg, so
the process can easily resume if the system restarts.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
Contraction is the last piece of the puzzle for allowing AnyRAID to be
flexible in terms of allowing users to add and use storage to their
vdevs as they wish. Thus far, it has not been possible to shrink the
logical size of a vdev in ZFS; mirrors can be detached, but that doesn't
actually reduce the available space in the pool, just the amount of
parity in the mirror. Device removal removes a whole top-level vdev, but
doesn't shrink an individual vdev. Contraction is different; it actually
shrinks the top-level vdev but leaves it in place.

This works by taking all the tiles on the leaf vdev being removed and
moving them to other devices in the anyraid vdev. This does impose some
restrictions, since if we cannot find a way to move all the tiles to
other leaf vdevs, the contraction cannot happen without reducing
redundancy. This plan generation step works similarly to rebalance, and
could be improved in the future with more advanced algorithms.

The second an third phases of contraction are also similar to rebalance,
and much of the code is shared between the two. The final phase is the
actual shrinking of the top-level vdev, which only takes a single
txg. In order to shrink a top-level vdev we need to ensure that no data
is present in the metaslabs that will be removed when the asize is
reduced; if there is anything allocated in those regions we fail to
contract, even if that data could be moved. Future work may allow for a
device-removal-like remapping of that data before contraction occurs.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Contributions-by: JT Pennington <jt.pennington@klarasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants