Detect a slow raidz child during reads#17227
Conversation
|
Because there will be a little bouncing between two PRs, and because there's two different authors involved, I'll be pushing fixup commits to this branch. Once everyone is happy with review, I will squash them down for merge. I'll close out the remaining review comments on #16900, and would like it if new comments could be added here. Thanks all for your patience; I know its a bit fiddly (it'd be nicer if Github would allow a PR to change branches, alas). |
905c466 to
4a1a3d3
Compare
|
I haven't looked into it, but I see all the FreeBSD runners have: |
|
@robn this fixed the FreeBSD CI errors for me: tonyhutter@a491397. Try squashing that in and rebasing. |
behlendorf
left a comment
There was a problem hiding this comment.
Thanks for picking up this work! It'd be great if we can refine this a bit more and get it integrated.
83d0de9 to
6fa275b
Compare
|
I've updated this branch to fix a few things. First, I've significantly reduced (hopefully eliminated almost entirely) the unnecessary sit-outs Brian was reporting. We reproduced them internally during our performance testing and the updated version doesn't display them at all. The changes here are to use the latency histogram stats instead of the EWMA as a better source of data. We also decrease the check frequency dramatically to reduce noise, decrease the number of outliers to compensate (along with adding a facility for extreme events to increase their outlier count more rapidly), increase the fence value significantly, and add a decay mechanism to prevent random noise from eventually causing a sit-out of healthy disks. Second, I've added an Third, I've made the sitout property writeable. This allows individual vdevs to be sat out from userland. This, in conjunction with the autosit property, allows the user to decide if they want no disk sit-outs, the kernel's automatic sit-outs, or to do something more complex. Using zpool iostat latency data, SMART stats, or any other data source they can think of, they could now create a userland daemon that monitors disk health and sits out disks that it feels are unhealthy. Giving the capability to this in userland has a number of advantages: easier access to high-level languages and their rich libraries, more safe and rapid iteration of complex logic, and the ability to improve the logic using new developments and advanced approaches without requiring a kernel upgrade or downtime. The kernel functionality is left in place as a simple plug-and-play approach. |
amotin
left a comment
There was a problem hiding this comment.
I agree that histograms should indeed be a better source of data, comparing to EWMA, except as I mention below, we may take a closer look when we update the previous state. Same time, I wonder if histograms may actually give us even more statistical information about the distribution curves, so that we could better estimate the confidence interval on a small number of disks.
e620961 to
9138823
Compare
behlendorf
left a comment
There was a problem hiding this comment.
Given the amount of churn in this PR it'd be nice to squash the commits the next time it's updated.
b4ad263 to
7013cb6
Compare
8b741cd to
0e3ae43
Compare
0e3ae43 to
56cc562
Compare
713ac04 to
afbd724
Compare
397c375 to
4ed592e
Compare
a33ff6c to
6b529ac
Compare
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
6b529ac to
97393b1
Compare
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes openzfs#17227
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#17227
|
@pcd1193182 thanks for picking up this often-dropped PR and getting it over the line. Thanks all for the feedback! |
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#17227
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes openzfs#17227
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#17227
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes openzfs#17227
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#17227
|
Not sure anyone will see this comment, but this capability seems like it could be adapted to my (admittedly unusual, and against some recommendations) setup. I have a raidz1 with just 2 disks. I did this because I want to be able to expand it later to more disks. But for the moment it is what I have. I think that the code in this PR could help me get near mirror read performance from this setup, since the parity disks have exactly all the data, and CPU is relatively cheap. Wonder if the people here more knowlegeable than me would agree with this assessment, and whether they think it would be worthwhile to pursue. It seems like it wouldn't be very useful when there's more than 2 disks involved, and I don't know how common this configuration is (though since 2 disk raidz1 became possible, I imagine other people might have been tempted to do what I did). |
|
@stevekstevek The sit out functionality requires at least 5 ( |
|
It's worth noting that in the abstract, your read performance already shouldn't be that different from a mirror. With the raidz1 quirk where the parity and data swap every megabyte, it should already be issuing reads to both devices. This would matter more with 3-disk raidz or 4-disk raidz, where that doesn't happen, but it also doesn't require this functionality; the raidz io start logic could switch to mirror-style dispatching for single sector reads, at least in theory. |
Motivation and Context
Replacing #16900, which was almost finished with review updates but has stalled. I've been asked to take it over.
See original PR for details.
Types of changes
Checklist:
Signed-off-by.