linux-kernel - Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1bdkWVLDBrPKFVa7oPNqAw5BCbNo1N393ESp-zQOT0w5A@mail.gmail.com>
Date: Tue, 16 Dec 2025 15:47:31 +0800
From: Joanne Koong <joannelkoong@...il.com>
To: Caleb Sander Mateos <csander@...estorage.com>
Cc: Jens Axboe <axboe@...nel.dk>, io-uring@...r.kernel.org, linux-kernel@...r.kernel.org, 
	syzbot@...kaller.appspotmail.com
Subject: Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER

On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
<csander@...estorage.com> wrote:
>
> On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@...il.com> wrote:
> >
> > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > <csander@...estorage.com> wrote:
> > >
> > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > are very hot instructions. The mutex's primary purpose is to prevent
> > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > task will make io_uring_enter() and io_uring_register() system calls on
> > > the io_ring_ctx once it's enabled.
> > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > other tasks acquiring the ctx uring lock, use a task work item to
> > > suspend the submitter_task for the critical section.
> >
> > Does this open the pathway to various data corruption issues since the
> > submitter task can be suspended while it's in the middle of executing
> > a section of logic that was previously protected by the mutex? With
>
> I don't think so. The submitter task is suspended by having it run a
> task work item that blocks it until the uring lock is released by the
> other task. Any section where the uring lock is held should either be
> on kernel threads, contained within an io_uring syscall, or contained
> within a task work item, none of which run other task work items. So
> whenever the submitter task runs the suspend task work, it shouldn't
> be in a uring-lock-protected section.
>
> > this patch (if I'm understandng it correctly), there's now no
> > guarantee that the logic inside the mutexed section for
> > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > if it gets suspended between two state changes that need to be atomic
> > / bundled together, then I think the task that does the suspend would
> > now see corrupt state.
>
> Yes, I suppose there's nothing that prevents code from holding the
> uring lock across syscalls or task work items, but that would already
> be problematic. If a task holds the uring lock on return from a
> syscall or task work and then runs another task work item that tries
> to acquire the uring lock, it would deadlock.
>
> >
> > I did a quick grep and I think one example of this race shows up in
> > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > time the submitter task is unregistering the buffers, then this chain
> > of events happens:
> > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > updated
> > * submitter task gets suspended through io_register_clone_buffers() ->
> > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
>
> I think what this is missing is that the submitter task can't get
> suspended at arbitrary points. It gets suspended in task work, and
> task work only runs when returning from the kernel to userspace. At

Ahh I see, thanks for the explanation. The documentation for
TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
that the it will interrupt the targeted task and run the task_work,
regardless of whether the task is currently running in the kernel or
userspace" so i had assumed this preempts the kernel.

Thanks,
Joanne

> which point "nothing" should be running on the task in userspace or
> the kernel and it should be safe to run arbitrary task work items on
> the task. Though Ming recently found an interesting deadlock caused by
> acquiring a mutex in task work that runs on an unlucky ublk server
> thread[1].
>
> [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
>
> Best,
> Caleb
>
> > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > dereference a NULL pointer
> >
> > Thanks,
> > Joanne
> >
> > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > may be set concurrently, so acquire the uring_lock before checking it.
> > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > mutual exclusion.
> > >
> > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > > Tested-by: syzbot@...kaller.appspotmail.com
> > > ---
> > >  io_uring/io_uring.c |  12 +++++
> > >  io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > >  2 files changed, 123 insertions(+), 3 deletions(-)
> > >