Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vdev property to bypass vdev queue #16591

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MigeljanImeri
Copy link
Contributor

Added vdev to bypass vdev queue when reading / writing from / to vdev. The intention behind this property is to improve performance when using o_direct.

Motivation and Context

Bypassing the vdev queue yields significant performance improvements when doing o_direct random reads on NVMe SSD's. We noticed a contention point with the vq->vq_lock in a flamegraph and wondered if this contention could be reduced through bypassing the vdev queue.

Description

An additional check is needed in vdev_queue_io_done to see if the ZIO was actually added to the queue, since the property may have been toggled in between vdev_queue_io and vdev_queue_io_done.

Using a gen 5 NVMe SSD, we were able to get roughly 3.5x the IOPS (136k -> 469k), through bypassing the vdev queue when doing o_direct random reads.

How Has This Been Tested?

Added a test case to check if toggling the bypassqueue property will cause kernel panic. When creating this vdev property, I ran into a situation where if the property was toggled while a ZIO was going through the pipeline, it would cause a panic because it was expected to be in a queue when it was never added to the queue. We now check if the ZIO has been added to the vdev queue before removing it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

@MigeljanImeri
Copy link
Contributor Author

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Changed to avoid obtaining the queue lock if nothing gets removed from the queue.

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Changed to set the io_queue_state if not queueing the zio.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

Changed property to use positive meaning.

@@ -156,6 +156,8 @@ If this device should perform new allocations, used to disable a device
when it is scheduled for later removal.
See
.Xr zpool-remove 8 .
.It Sy queue_io
Add io to the vdev queue when reading or writing to this vdev.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to give a hint to the user about when they would want to change this prop:

Add IO to the vdev queue when reading or writing to this vdev.
Disabling this property can sometimes improve performance for direct IOs.

Feel free to re-word this ⬆️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 70 to 83
log_must eval "fio --filename=$mntpnt/foobar --name=write-file \
--rw=write --size=$MINVDEVSIZE --bs=128k --numjobs=1 --direct=1 \
--ioengine=sync --runtime=10"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also do the toggle_queue_io test while doing the write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure makes sense to me.
Added.

verify_runnable "global"

command -v fio > /dev/null || log_unsupported "fio missing"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to save/set/restore the zfs_dio_enabled module param to use Direct IO

log_must save_tunable DIO_ENABLED
log_must set_tunable32 DIO_ENABLED 1
...
log_must restore_tunable DIO_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@robn
Copy link
Member

robn commented Oct 7, 2024

I am uncomfortable with this change.

I agree that the vq_lock runs white-hot when the disks underneath are very very fast. I've spent many months trying to reorganise this lock, with mixed success, mostly because in all my testing it just moves the contention to other locks. That was all pre-direct-IO though, so that may be different now, though my own workloads are not O_DIRECT.

So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find (vdev_deadman()). That's actually the main reason this lock runs hot, because removing the IO from the list is the only reason we take it on the IO return path.

The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward.

I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other vdev_queue_* parameters have the same issue.

For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet.

If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those).

(I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming).

Added vdev property to toggle queueing io to vdev queue when
reading / writing from / to vdev. The intention behind this
property is to improve performance when using o_direct.

Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants