Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always validate checksums for Direct I/O reads #16598

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Oct 3, 2024

This fixes an oversight in the Direct I/O PR. There is nothing stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are stil correct, but this would lead false reporting of checksum validations.

To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO columnm. Also the zevent has been updated to have both:

  1. dio_verify_wr -> Checksum verification failure for writes
  2. dio_verify_rd -> Checksum verification failure for reads. This allows for determining what I/O operation was the culprit for the checksum verification failure. All DIO errors are reported only on the top-level VDEV.

Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux in department.

This commit updates the following:

  1. Propogates checksum failures for reads all the way up to the top-level VDEV.
  2. Reports errors through zpool status -d as DIO.
  3. Has two zevents for checksum verify errors with Direct I/O. One for read and one for write.
  4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and handle ABD buffer contents validation the same as Linux.
  5. Moves the declartion of nbytes in zfs_read() to the top of the function and outside of the while loop. This was needed due to a compliation failure in FreeBSD.
  6. Updated manipulate_user_buffer.c to also manipulate a buffer while a Direct I/O read is taking place.
  7. Adds a new test case dio_read_verify that stress tests the new code.

This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors.

Make sure that checksum verification failures with Direct I/O reads are reported and always reissued through the ARC.

Motivation and Context

Without this, there are false positive checksum verification failures reported when a Direct I/O read is issued and a process is manipulating the buffer contents.

Description

See above.

How Has This Been Tested?

Added new test case. Tested on Linux: 6.5.12-100.fc37.x86_64 kernel. On that same Linux kernel I used Virtual Manger to install Windows 11, Ubuntu 22.04, and Fedora 38 on a dataset with the direct dataset property set to always. Ran ZTS direct test cases as well.

On FreeBSD tested ZTS direct tests on FreeBSD 13.3 -RELEASE and FreeBSD 14-CURRENT VM's.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

This fixes an oversight in the Direct I/O PR. There is nothing stops a
process from manipulating the contents of a buffer for a Direct I/O read
while the I/O is in flight. This can lead checksum verify failures.
However, the disk contents are stil correct, but this would lead false
reporting of checksum validations.

To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO columnm. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes
2. dio_verify_rd -> Checksum verification failure for reads.
This allows for determining what I/O operation was the culprit for the
checksum verification failure. All DIO errors are reported only on the
top-level VDEV.

Even though FreeBSD can write protect pages (stable pages) it still has
the same issue as Linux in department.

This commit updates the following:
1. Propogates checksum failures for reads all the way up to the
   top-level VDEV.
2. Reports errors through zpool status -d as DIO.
3. Has two zevents for checksum verify errors with Direct I/O. One for
   read and one for write.
4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and
   handle ABD buffer contents validation the same as Linux.
5. Moves the declartion of nbytes in zfs_read() to the top of the
   function and outside of the while loop. This was needed due to a
   compliation failure in FreeBSD.
6. Updated manipulate_user_buffer.c to also manipulate a buffer while a
   Direct I/O read is taking place.
7. Adds a new test case dio_read_verify that stress tests the new code.

This issue was first observed when installing a Windows 11 VM on a ZFS
dataset with the dataset property direct set to always. The zpool
devices would report checksum failures, but running a subsequent zpool
scrub would not repair any data and report no errors.

Signed-off-by: Brian Atkinson <[email protected]>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 3, 2024
Comment on lines +651 to +653
* for nay ABD's that contain user pages from Direct I/O. See
* the comment above about Direct I/O read buffers possibly
* being manipulated. In order to handle this, we jsut interate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for any ABDs"
"we just iterate"

@@ -1008,7 +1008,9 @@ abd_borrow_buf_copy(abd_t *abd, size_t n)
* borrowed. We can not ASSERT that the contents of the buffer have not changed
* if it is composed of user pages because the pages can not be placed under
* write protection and the user could have possibly changed the contents in
* the pages at any time.
* the pages at any time. This is also an issue for Direct I/O reads. Checksum
* verifications in the ZIO pipeline check fo rthis issue and handle it by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"check for this issue"

# STRATEGY:
# 1. Create a zpool from each vdev type.
# 2. Start a Direct I/O read workload while manipulating the user buffer
# without using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"without using.."?

@tonyhutter
Copy link
Contributor

Note: I'm not seeing the DIO errors in the JSON output (zpool status -d -j | jq). I dunno if you want to add that in this PR or another PR, but thought it was worth mentioning.

@tonyhutter
Copy link
Contributor

This should give you the JSON:

diff --git a/cmd/zpool/zpool_main.c b/cmd/zpool/zpool_main.c
index aa7da68aa..90f4225e1 100644
--- a/cmd/zpool/zpool_main.c
+++ b/cmd/zpool/zpool_main.c
@@ -9222,6 +9222,11 @@ vdev_stats_nvlist(zpool_handle_t *zhp, status_cbdata_t *cb, nvlist_t *nv,
                                fnvlist_add_string(vds, "power_state", "-");
                        }
                }
+               if (cb->cb_print_dio_verify) {
+                       nice_num_str_nvlist(vds, "dio_verify_errors",
+                           vs->vs_dio_verify_errors, cb->cb_literal,
+                           cb->cb_json_as_int, ZFS_NICENUM_1024);
+               }
        }
 
        if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT,

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification and error propagation looks good. Thanks for running down this very strange but entirely possible case. Most of the comments are for minor nits.

@@ -1832,6 +1833,19 @@ vdev_indirect_io_done(zio_t *zio)

zio_bad_cksum_t zbc;
int ret = zio_checksum_error(zio, &zbc);
/*
* Any Direct I/O read with that has a checksum error must be treated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Any Direct I/O read with that has a checksum error must be treated
* Any Direct I/O read that has a checksum error must be treated

* suspicious. The contents of the buffer could
* have beeen manipulated while the I/O was in
* flight. In this case, the remainder of I/O
* request will just be just be reissued through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* request will just be just be reissued through
* request will just be reissued through the ARC.

uio->uio_extflg &= ~UIO_DIRECT;
n += dio_remaining_resid;
dio_remaining_resid = 0;
goto top;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goto looks like it can be replaced with continue here and the "top" label removed.

@@ -472,6 +493,10 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
out:
zfs_rangelock_exit(lr);

if (dio_checksum_failure == B_TRUE) {
uio->uio_extflg |= UIO_DIRECT;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: parenthesis can be removed.

* checksum verify failure is removed from the parent. If it
* occurs again on the ditto block, then the parent will
* require this flag for reporting the erorr to the top-level
* Mirror VDEV in vdev_mirror_io_done().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Mirror VDEV in vdev_mirror_io_done().
* mirror VDEV in vdev_mirror_io_done().

* Any Direct I/O read that has a checksum
* error must be treated as suspicous as the
* contents of the buffer could be getting
* manipulated whiel the I/O is taking place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* manipulated whiel the I/O is taking place.
* manipulated while the I/O is taking place.

* The checksum verify error will only be
* reported here for disk and file VDEV's and
* will be reported on those that the failure
* occurred on. Oter types of VDEV's report the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* occurred on. Oter types of VDEV's report the
* occurred on. Other types of VDEV's report the

void
zio_dio_chksum_verify_error_report(zio_t *zio)
{
ASSERT(zio->io_flags & ZIO_FLAG_DIO_CHKSUM_ERR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add an empty line after the ASSET.

mutex_exit(&zio->io_vd->vdev_stat_lock);
if (zio->io_type == ZIO_TYPE_WRITE) {
/*
* Convert checksum error error for writes into EIO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Convert checksum error error for writes into EIO.
* Convert checksum error for writes into EIO.


log_must truncate -s $MINVDEVSIZE $DIO_VDEVS

# We will verfiy that there is no checksum errors for every Direct I/O read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We will verfiy that there is no checksum errors for every Direct I/O read
# We will verify that there are no checksum errors for every Direct I/O read

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants