linux-kernel - Re: [PATCH v3 2/2] ublk: require unique task per io instead of unique task per hctx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aABZl4Yxdf3yew4q@fedora>
Date: Thu, 17 Apr 2025 09:29:59 +0800
From: Ming Lei <ming.lei@...hat.com>
To: Uday Shankar <ushankar@...estorage.com>
Cc: Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Caleb Sander Mateos <csander@...estorage.com>
Subject: Re: [PATCH v3 2/2] ublk: require unique task per io instead of
 unique task per hctx

On Tue, Apr 15, 2025 at 06:12:09PM -0600, Uday Shankar wrote:
> On Fri, Apr 11, 2025 at 04:53:19PM +0800, Ming Lei wrote:
> > On Thu, Apr 10, 2025 at 06:17:51PM -0600, Uday Shankar wrote:
> > > Currently, ublk_drv associates to each hardware queue (hctx) a unique
> > > task (called the queue's ubq_daemon) which is allowed to issue
> > > COMMIT_AND_FETCH commands against the hctx. If any other task attempts
> > > to do so, the command fails immediately with EINVAL. When considered
> > > together with the block layer architecture, the result is that for each
> > > CPU C on the system, there is a unique ublk server thread which is
> > > allowed to handle I/O submitted on CPU C. This can lead to suboptimal
> > > performance under imbalanced load generation. For an extreme example,
> > > suppose all the load is generated on CPUs mapping to a single ublk
> > > server thread. Then that thread may be fully utilized and become the
> > > bottleneck in the system, while other ublk server threads are totally
> > > idle.
> > > 
> > > This issue can also be addressed directly in the ublk server without
> > > kernel support by having threads dequeue I/Os and pass them around to
> > > ensure even load. But this solution requires inter-thread communication
> > > at least twice for each I/O (submission and completion), which is
> > > generally a bad pattern for performance. The problem gets even worse
> > > with zero copy, as more inter-thread communication would be required to
> > > have the buffer register/unregister calls to come from the correct
> > > thread.
> > 
> > Agree.
> > 
> > The limit is actually originated from current implementation, both
> > REGISTER_IO_BUF and UNREGISTER_IO_BUF should be fine to run from other
> > pthread because the request buffer 'meta' is actually read-only.
> > 
> > > 
> > > Therefore, address this issue in ublk_drv by requiring a unique task per
> > > I/O instead of per queue/hctx. Imbalanced load can then be balanced
> > > across all ublk server threads by having threads issue FETCH_REQs in a
> > > round-robin manner. As a small toy example, consider a system with a
> > > single ublk device having 2 queues, each of queue depth 4. A ublk server
> > > having 4 threads could issue its FETCH_REQs against this device as
> > > follows (where each entry is the qid,tag pair that the FETCH_REQ
> > > targets):
> > > 
> > > poller thread:	T0	T1	T2	T3
> > > 		0,0	0,1	0,2	0,3
> > > 		1,3	1,0	1,1	1,2
> > > 
> > > Since tags appear to be allocated in sequential chunks, this setup
> > > provides a rough approximation to distributing I/Os round-robin across
> > > all ublk server threads, while letting I/Os stay fully thread-local.
> > 
> > BLK_MQ_F_TAG_RR can be set for this way, so is it possible to make this
> > as one feature? And set BLK_MQ_F_TAG_RR for this feature.
> 
> Yes, it would be easy enough to add. However we have been testing with

That is why I suggest to add it as one feature, such as, PER_IO_TASK,
then you can run any optimization on this feature only in future.

There are other differences for this feature, such as, how to set each io
task's affinity, how to partition tag space in optimized way, ...

BTW, recently I found it is helpful to get good perf by only selecting one
cpu as the queue thread's sched affinity.

One feature flag also has document benefit.

Also `Documentation/block/ublk.rst` need to be updated with this
change/feature.

Fortunately the cancel code patch has been generic enough to cover
PER_IO_TASK already.

> the v1 patch [1] for a while now, and have seen pretty even load
> balancing even without BLK_MQ_F_TAG_RR. So I am not sure if it is worth
> it/if we will use the flag, especially considering that it is documented
> as reducing performance.

per-io task actually depends on IO balance over each partitioned tag space,
which highly relies on tag allocation algorithm.

> 
> [1] https://lore.kernel.org/all/20241002224437.3088981-1-ushankar@purestorage.com/
> 
> > Also can you share what the preferred implementation is for ublk server?
> > 
> > I think per-io pthread may not be good, maybe partition tags space into
> > fixed range/pthread?
> 
> By "unique task per io" I mean that each io can have its own task
> (including two ios in the same queue can have different tasks), but two
> ios can have the same task.
> 
> That's roughly what we're doing, we have a handful of threads (around
> 8-16) and we split up the I/Os between them. With this patch we lift the
> restriction that each thread corresponds 1:1 with a ublk_queue/hctx.

OK, care to add one command line(such as queue_tasks) to enable it in
ublk kernel selftest? Then it can serve:

- the added code can be covered in selftest

- avoid to break this feature by future change

- example for showing how to use this feature

- run performance evaluation with different target setting

The main change should be in ublk_io_handler_fn() & ublk_queue_init() by
allocating one io_uring array for each queue. For target code, we already
have ublk_queue_alloc_sqes(), in which the ring selection can be done
centrally & transparently.

> 
> > `ublk_queue' reference is basically read-only in IO code path, I think
> > it need to be declared explicitly as 'const' pointer in IO code/uring code
> > path first. Otherwise, it is easy to trigger data race with per-io task
> > since it is lockless.
> 
> That is a good suggestion.

Great to see you have started it.

Maybe it can be the prepare patches, which can be merged first.



Thanks,
Ming