-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always validate checksums for Direct I/O reads #16598
base: master
Are you sure you want to change the base?
Conversation
This fixes an oversight in the Direct I/O PR. There is nothing stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are stil correct, but this would lead false reporting of checksum validations. To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued. Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO columnm. Also the zevent has been updated to have both: 1. dio_verify_wr -> Checksum verification failure for writes 2. dio_verify_rd -> Checksum verification failure for reads. This allows for determining what I/O operation was the culprit for the checksum verification failure. All DIO errors are reported only on the top-level VDEV. Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux in department. This commit updates the following: 1. Propogates checksum failures for reads all the way up to the top-level VDEV. 2. Reports errors through zpool status -d as DIO. 3. Has two zevents for checksum verify errors with Direct I/O. One for read and one for write. 4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and handle ABD buffer contents validation the same as Linux. 5. Moves the declartion of nbytes in zfs_read() to the top of the function and outside of the while loop. This was needed due to a compliation failure in FreeBSD. 6. Updated manipulate_user_buffer.c to also manipulate a buffer while a Direct I/O read is taking place. 7. Adds a new test case dio_read_verify that stress tests the new code. This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors. Signed-off-by: Brian Atkinson <[email protected]>
* for nay ABD's that contain user pages from Direct I/O. See | ||
* the comment above about Direct I/O read buffers possibly | ||
* being manipulated. In order to handle this, we jsut interate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for any ABDs"
"we just iterate"
@@ -1008,7 +1008,9 @@ abd_borrow_buf_copy(abd_t *abd, size_t n) | |||
* borrowed. We can not ASSERT that the contents of the buffer have not changed | |||
* if it is composed of user pages because the pages can not be placed under | |||
* write protection and the user could have possibly changed the contents in | |||
* the pages at any time. | |||
* the pages at any time. This is also an issue for Direct I/O reads. Checksum | |||
* verifications in the ZIO pipeline check fo rthis issue and handle it by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"check for this issue"
# STRATEGY: | ||
# 1. Create a zpool from each vdev type. | ||
# 2. Start a Direct I/O read workload while manipulating the user buffer | ||
# without using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"without using.."?
Note: I'm not seeing the DIO errors in the JSON output ( |
This should give you the JSON: diff --git a/cmd/zpool/zpool_main.c b/cmd/zpool/zpool_main.c
index aa7da68aa..90f4225e1 100644
--- a/cmd/zpool/zpool_main.c
+++ b/cmd/zpool/zpool_main.c
@@ -9222,6 +9222,11 @@ vdev_stats_nvlist(zpool_handle_t *zhp, status_cbdata_t *cb, nvlist_t *nv,
fnvlist_add_string(vds, "power_state", "-");
}
}
+ if (cb->cb_print_dio_verify) {
+ nice_num_str_nvlist(vds, "dio_verify_errors",
+ vs->vs_dio_verify_errors, cb->cb_literal,
+ cb->cb_json_as_int, ZFS_NICENUM_1024);
+ }
}
if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verification and error propagation looks good. Thanks for running down this very strange but entirely possible case. Most of the comments are for minor nits.
@@ -1832,6 +1833,19 @@ vdev_indirect_io_done(zio_t *zio) | |||
|
|||
zio_bad_cksum_t zbc; | |||
int ret = zio_checksum_error(zio, &zbc); | |||
/* | |||
* Any Direct I/O read with that has a checksum error must be treated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Any Direct I/O read with that has a checksum error must be treated | |
* Any Direct I/O read that has a checksum error must be treated |
* suspicious. The contents of the buffer could | ||
* have beeen manipulated while the I/O was in | ||
* flight. In this case, the remainder of I/O | ||
* request will just be just be reissued through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* request will just be just be reissued through | |
* request will just be reissued through the ARC. |
uio->uio_extflg &= ~UIO_DIRECT; | ||
n += dio_remaining_resid; | ||
dio_remaining_resid = 0; | ||
goto top; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goto
looks like it can be replaced with continue
here and the "top" label removed.
@@ -472,6 +493,10 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr) | |||
out: | |||
zfs_rangelock_exit(lr); | |||
|
|||
if (dio_checksum_failure == B_TRUE) { | |||
uio->uio_extflg |= UIO_DIRECT; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: parenthesis can be removed.
* checksum verify failure is removed from the parent. If it | ||
* occurs again on the ditto block, then the parent will | ||
* require this flag for reporting the erorr to the top-level | ||
* Mirror VDEV in vdev_mirror_io_done(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Mirror VDEV in vdev_mirror_io_done(). | |
* mirror VDEV in vdev_mirror_io_done(). |
* Any Direct I/O read that has a checksum | ||
* error must be treated as suspicous as the | ||
* contents of the buffer could be getting | ||
* manipulated whiel the I/O is taking place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* manipulated whiel the I/O is taking place. | |
* manipulated while the I/O is taking place. |
* The checksum verify error will only be | ||
* reported here for disk and file VDEV's and | ||
* will be reported on those that the failure | ||
* occurred on. Oter types of VDEV's report the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* occurred on. Oter types of VDEV's report the | |
* occurred on. Other types of VDEV's report the |
void | ||
zio_dio_chksum_verify_error_report(zio_t *zio) | ||
{ | ||
ASSERT(zio->io_flags & ZIO_FLAG_DIO_CHKSUM_ERR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd add an empty line after the ASSET.
mutex_exit(&zio->io_vd->vdev_stat_lock); | ||
if (zio->io_type == ZIO_TYPE_WRITE) { | ||
/* | ||
* Convert checksum error error for writes into EIO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Convert checksum error error for writes into EIO. | |
* Convert checksum error for writes into EIO. |
|
||
log_must truncate -s $MINVDEVSIZE $DIO_VDEVS | ||
|
||
# We will verfiy that there is no checksum errors for every Direct I/O read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We will verfiy that there is no checksum errors for every Direct I/O read | |
# We will verify that there are no checksum errors for every Direct I/O read |
This fixes an oversight in the Direct I/O PR. There is nothing stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are stil correct, but this would lead false reporting of checksum validations.
To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued.
Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO columnm. Also the zevent has been updated to have both:
Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux in department.
This commit updates the following:
This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors.
Make sure that checksum verification failures with Direct I/O reads are reported and always reissued through the ARC.
Motivation and Context
Without this, there are false positive checksum verification failures reported when a Direct I/O read is issued and a process is manipulating the buffer contents.
Description
See above.
How Has This Been Tested?
Added new test case. Tested on Linux: 6.5.12-100.fc37.x86_64 kernel. On that same Linux kernel I used Virtual Manger to install Windows 11, Ubuntu 22.04, and Fedora 38 on a dataset with the direct dataset property set to always. Ran ZTS direct test cases as well.
On FreeBSD tested ZTS direct tests on FreeBSD 13.3 -RELEASE and FreeBSD 14-CURRENT VM's.
Types of changes
Checklist:
Signed-off-by
.