Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIO: add "vdev tracing" facility; use it for ZIL flushing #16375

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Commits on Aug 17, 2024

  1. zts: test for correct fsync() response to ZIL flush failure

    If fsync() (zil_commit()) writes successfully, but then the flush fails,
    fsync() should not return success, but instead should fall into a full
    transaction wait.
    
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    Signed-off-by: Rob Norris <[email protected]>
    robn committed Aug 17, 2024
    Configuration menu
    Copy the full SHA
    6ad07c0 View commit details
    Browse the repository at this point in the history
  2. zio_flush: propagate flush errors to the ZIL

    Since the beginning, ZFS' "flush" operation has always ignored
    errors[1]. Write errors are captured and dealt with, but if a write
    succeeds but the subsequent flush fails, the operation as a whole will
    appear to succeed[2].
    
    In the end-of-transaction uberblock+label write+flush ceremony, it's
    very difficult for this situation to occur. Since all devices are
    written to, typically the first write will succeed, the first flush will
    fail unobserved, but then the second write will fail, and the entire
    transaction is aborted. It's difficult to imagine a real-world scenario
    where all the writes in that sequence could succeed even as the flushes
    are failing (understanding that the OS is still seeing hardware problems
    and taking devices offline).
    
    In the ZIL however, it's another story. Since only the write response is
    checked, if that write succeeds but the flush then fails, the ZIL will
    believe that it succeeds, and zil_commit() (and thus fsync()) will
    return success rather than the "correct" behaviour of falling back into
    txg_wait_synced()[3].
    
    This commit fixes this by adding a simple flag to zio_flush() to
    indicate whether or not the caller wants to receive flush errors. This
    flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
    and the flush handler zil_lwb_flush_vdevs_done() already has all the
    necessary support to properly handle a flush failure and fail the entire
    zio chain. This causes zil_commit() to correct fall back to
    txg_wait_synced() rather than returning success prematurely.
    
    1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
       for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
       ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
       from the time shows the thinking:
    
       /*
        * Wait for all the flushes to complete.  Not all devices actually
        * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
        */
    
    2. It's not entirely clear from the code history why this was acceptable
       for devices that _do_ have write caches. Our best guess is that this
       was an oversight: between the combination of hardware, pool topology
       and application behaviour required to hit this, it basically didn't
       come up.
    
    3. Somewhat frustratingly, zil.c contains comments describing this exact
       behaviour, and further discussion in openzfs#12443 (September 2021). It
       appears that those involved saw the potential, but were looking at a
       different problem and so didn't have the context to recognise it for
       what it was.
    
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    Signed-off-by: Rob Norris <[email protected]>
    robn committed Aug 17, 2024
    Configuration menu
    Copy the full SHA
    e7966e5 View commit details
    Browse the repository at this point in the history
  3. flush: don't report flush error when disabling flush support

    The first time a device returns ENOTSUP in repsonse to a flush request,
    we set vdev_nowritecache so we don't issue flushes in the future and
    instead just pretend the succeeded. However, we still return an error
    for the initial flush, even though we just decided such errors are
    meaningless!
    
    So, when setting vdev_nowritecache in response to a flush error, also
    reset the error code to assume success.
    
    Along the way, it seems there's no good reason for vdev_disk & vdev_geom
    to explicitly detect no support for flush and set vdev_nowritecache;
    just letting the error through to zio_vdev_io_assess() will cause it all
    to fall out nicely. So remove those checks.
    
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    Signed-off-by: Rob Norris <[email protected]>
    robn committed Aug 17, 2024
    Configuration menu
    Copy the full SHA
    2331d19 View commit details
    Browse the repository at this point in the history
  4. ZTS: ZIL txg sync fallback test

    If the pool is degraded, ZIL writes should still succeed without falling
    back to a full txg sync.
    
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    Signed-off-by: Rob Norris <[email protected]>
    robn committed Aug 17, 2024
    Configuration menu
    Copy the full SHA
    8a09462 View commit details
    Browse the repository at this point in the history
  5. ZIO: add "vdev tracing" facility; use it for ZIL flushing

    A problem with zio_flush() is that it issues a flush ZIO to a top-level
    vdev, which then recursively issues child flush ZIOs until the real leaf
    devices are flushed. As usual, an error in a child ZIO results in the
    parent ZIO erroring too, so if a leaf device has failed, it's flush ZIO
    will fail, and so will the entire flush operation.
    
    This didn't matter when we used to ignore flush errors, but now that we
    propagate them, the flush error propagates into the ZIL write ZIO. This
    causes the ZIL to believe its write failed, and fall back to a full txg
    wait. This still provides correct behaviour for zil_commit() callers (eg
    fsync()) but it ruins performance.
    
    We cannot simply skip flushing failed vdevs, because the associated
    write may have succeeded before the vdev failed, which would give the
    appearance the write is fully flushed when it is not. Neither can we
    issue a "syncing write" to the device (eg SCSI FUA), as this also
    degrades performance.
    
    The answer is that we must bind writes and flushes together in a way
    such that we only flush the physical devices that we wrote to.
    
    This adds a "vdev tracing" facility to ZIOs, zio_vdev_trace. When
    enabled on a ZIO with ZIO_FLAG_VDEV_TRACE, then upon successful
    completion (in the _done handler), zio->io_vdev_trace_tree will have a
    list of zio_vdev_trace_t objects that each describe a vdev that was
    involved in the successful completion of the ZIO.
    
    A companion function, zio_vdev_trace_flush(), is included, that issues a
    flush ZIO to the child vdevs on the given trace tree.
    zil_lwb_write_done() is updated to use this to bind ZIL writes and
    flushes together.
    
    The tracing facility is similar in many ways to the "deferred flushing"
    facility inside the ZIL, to the point where it can replace it. Now, if
    the flush should be deferred, the trace records from the writing ZIO are
    captured and combined with any captured from previous writes. When its
    finally time to issue the flush, we issue it to the entire accumulated
    set of traced vdevs.
    
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    Signed-off-by: Rob Norris <[email protected]>
    robn committed Aug 17, 2024
    Configuration menu
    Copy the full SHA
    e35088c View commit details
    Browse the repository at this point in the history