[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSpLN3xPwCqToYrZ@fedora>
Date: Sat, 29 Nov 2025 09:24:07 +0800
From: Ming Lei <ming.lei@...hat.com>
To: Caleb Sander Mateos <csander@...estorage.com>
Cc: Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
Uday Shankar <ushankar@...estorage.com>,
Stefani Seibold <stefani@...bold.net>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO
On Fri, Nov 28, 2025 at 11:07:17AM -0800, Caleb Sander Mateos wrote:
> On Fri, Nov 28, 2025 at 8:19 AM Jens Axboe <axboe@...nel.dk> wrote:
> >
> > On 11/28/25 4:59 AM, Ming Lei wrote:
> > > On Fri, Nov 21, 2025 at 09:58:22AM +0800, Ming Lei wrote:
> > >> Hello,
> > >>
> > >> This patchset adds UBLK_F_BATCH_IO feature for communicating between kernel and ublk
> > >> server in batching way:
> > >>
> > >> - Per-queue vs Per-I/O: Commands operate on queues rather than individual I/Os
> > >>
> > >> - Batch processing: Multiple I/Os are handled in single operation
> > >>
> > >> - Multishot commands: Use io_uring multishot for reducing submission overhead
> > >>
> > >> - Flexible task assignment: Any task can handle any I/O (no per-I/O daemons)
> > >>
> > >> - Better load balancing: Tasks can adjust their workload dynamically
> > >>
> > >> - help for future optimizations:
> > >> - blk-mq batch tags free
> > >> - support io-poll
> > >> - per-task batch for avoiding per-io lock
> > >> - fetch command priority
> > >>
> > >> - simplify command cancel process with per-queue lock
> > >>
> > >> selftest are provided.
> > >>
> > >>
> > >> Performance test result(IOPS) on V3:
> > >>
> > >> - page copy
> > >>
> > >> tools/testing/selftests/ublk//kublk add -t null -q 16 [-b]
> > >>
> > >> - zero copy(--auto_zc)
> > >> tools/testing/selftests/ublk//kublk add -t null -q 16 --auto_zc [-b]
> > >>
> > >> - IO test
> > >> taskset -c 0-31 fio/t/io_uring -p0 -n $JOBS -r 30 /dev/ublkb0
> > >>
> > >> 1) 16 jobs IO
> > >> - page copy: 37.77M vs. 42.40M(BATCH_IO), +12%
> > >> - zero copy(--auto_zc): 42.83M vs. 44.43M(BATCH_IO), +3.7%
> > >>
> > >>
> > >> 2) single job IO
> > >> - page copy: 2.54M vs. 2.6M(BATCH_IO), +2.3%
> > >> - zero copy(--auto_zc): 3.13M vs. 3.35M(BATCH_IO), +7%
> > >>
> > >>
> > >> V4:
> > >> - fix handling in case of running out of mshot buffer, request has to
> > >> be un-prepared for zero copy
> > >> - don't expose unused tag to userspace
> > >> - replace fixed buffer with plain user buffer for
> > >> UBLK_U_IO_PREP_IO_CMDS and UBLK_U_IO_COMMIT_IO_CMDS
> > >> - replace iov iterator with plain copy_from_user() for
> > >> ublk_walk_cmd_buf(), code is simplified with performance improvement
> > >> - don't touch sqe->len for UBLK_U_IO_PREP_IO_CMDS and
> > >> UBLK_U_IO_COMMIT_IO_CMDS(Caleb Sander Mateos)
> > >> - use READ_ONCE() for access sqe->addr (Caleb Sander Mateos)
> > >> - all kinds of patch style fix(Caleb Sander Mateos)
> > >> - inline __kfifo_alloc() (Caleb Sander Mateos)
> > >
> > > Hi Caleb Sander Mateos and Jens,
> > >
> > > Caleb have reviewed patch 1 ~ patch 8, and driver patch 9 ~ patch 18 are not
> > > reviewed yet.
> > >
> > > I'd want to hear your idea for how to move on. So far, looks there are
> > > several ways:
> > >
> > > 1) merge patch 1 ~ patch 6 to v6.19 first, which can be prep patches for BATCH_IO
> > >
> > > 2) delay the whole patchset to v6.20 cycle
> > >
> > > 3) merge the whole patchset to v6.19
> > >
> > > I am fine with either one, which one do you prefer to?
> > >
> > > BTW, V4 pass all builtin function and stress tests, and there is just one small bug
> > > fix not posted yet, which can be a follow-up. The new feature takes standalone
> > > code path, so regression risk is pretty small.
> >
> > I'm fine taking the whole thing for 6.19. Caleb let me know if you
> > disagree. I'll queue 1..6 for now, then can follow up later today with
> > the rest as needed.
>
> Sorry I haven't gotten around to reviewing the rest of the series yet.
> I will try to take a look at them all this weekend. I'm not sure the
> batching feature would make sense for our ublk application use case,
> but I have no objection to it as long as it doesn't regress the
> non-batched ublk behavior/performance.
> No problem with queueing up patches 1-6 now (though patch 1 may need
> an ack from a kfifo maintainer?).
BTW, there are many good things with BATCH_IO features:
- batch blk-mq completion: page copy IO mode has shown >12% IOPS improvement; and
there is chance to apply it for zero copy too in future
- io poll become much easier to support: it can be used to poll nvme char/block device
to get better iops
- io cancel code path becomes less fragile, and easier to debug: in typical
implementation, there is only one or two per-queue FETCH(multishot)
command, others are just sync one-shot commands.
- more chances to improve perf: saved lots of generic uring_cmd code
path cost, such as, security_uring_cmd()
- `perf bug fix` for UBLK_F_PER_IO_DAEMON, meantime robust load balance
support
iops is improved by 4X-5X in `fio/t/io_uring -p0 /dev/ublkbN` between:
./kublk add -t null --nthreads 8 -q 4 --per_io_tasks
and
./kublk add -t null --nthreads 8 -q 4 -b
- with per-io lock: fast io path becomes more robust, still can be bypassed
in future in case of per-io-daemon
The cost is some complexity in ublk server implementation for maintaining
one or two per-queue FETCH buffer, and one or two per-queue COMMIT buffer.
Thanks,
Ming
Powered by blists - more mailing lists