[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZqZce=8LGtjZquxyQDfciOYu4fgtPFqwfkirWS5f6ALow@mail.gmail.com>
Date: Tue, 16 Dec 2025 07:49:34 -0800
From: Caleb Sander Mateos <csander@...estorage.com>
To: Joanne Koong <joannelkoong@...il.com>
Cc: Jens Axboe <axboe@...nel.dk>, io-uring@...r.kernel.org, linux-kernel@...r.kernel.org,
syzbot@...kaller.appspotmail.com
Subject: Re: [PATCH v5 6/6] io_uring: avoid uring_lock for IORING_SETUP_SINGLE_ISSUER
On Mon, Dec 15, 2025 at 11:47 PM Joanne Koong <joannelkoong@...il.com> wrote:
>
> On Tue, Dec 16, 2025 at 2:24 PM Caleb Sander Mateos
> <csander@...estorage.com> wrote:
> >
> > On Mon, Dec 15, 2025 at 8:46 PM Joanne Koong <joannelkoong@...il.com> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 4:10 AM Caleb Sander Mateos
> > > <csander@...estorage.com> wrote:
> > > >
> > > > io_ring_ctx's mutex uring_lock can be quite expensive in high-IOPS
> > > > workloads. Even when only one thread pinned to a single CPU is accessing
> > > > the io_ring_ctx, the atomic CASes required to lock and unlock the mutex
> > > > are very hot instructions. The mutex's primary purpose is to prevent
> > > > concurrent io_uring system calls on the same io_ring_ctx. However, there
> > > > is already a flag IORING_SETUP_SINGLE_ISSUER that promises only one
> > > > task will make io_uring_enter() and io_uring_register() system calls on
> > > > the io_ring_ctx once it's enabled.
> > > > So if the io_ring_ctx is setup with IORING_SETUP_SINGLE_ISSUER, skip the
> > > > uring_lock mutex_lock() and mutex_unlock() on the submitter_task. On
> > > > other tasks acquiring the ctx uring lock, use a task work item to
> > > > suspend the submitter_task for the critical section.
> > >
> > > Does this open the pathway to various data corruption issues since the
> > > submitter task can be suspended while it's in the middle of executing
> > > a section of logic that was previously protected by the mutex? With
> >
> > I don't think so. The submitter task is suspended by having it run a
> > task work item that blocks it until the uring lock is released by the
> > other task. Any section where the uring lock is held should either be
> > on kernel threads, contained within an io_uring syscall, or contained
> > within a task work item, none of which run other task work items. So
> > whenever the submitter task runs the suspend task work, it shouldn't
> > be in a uring-lock-protected section.
> >
> > > this patch (if I'm understandng it correctly), there's now no
> > > guarantee that the logic inside the mutexed section for
> > > IORING_SETUP_SINGLE_ISSUER submitter tasks is "atomically bundled", so
> > > if it gets suspended between two state changes that need to be atomic
> > > / bundled together, then I think the task that does the suspend would
> > > now see corrupt state.
> >
> > Yes, I suppose there's nothing that prevents code from holding the
> > uring lock across syscalls or task work items, but that would already
> > be problematic. If a task holds the uring lock on return from a
> > syscall or task work and then runs another task work item that tries
> > to acquire the uring lock, it would deadlock.
> >
> > >
> > > I did a quick grep and I think one example of this race shows up in
> > > io_uring/rsrc.c for buffer cloning where if the src_ctx has
> > > IORING_SETUP_SINGLE_ISSUER set and the cloning happens at the same
> > > time the submitter task is unregistering the buffers, then this chain
> > > of events happens:
> > > * submitter task is executing the logic in io_sqe_buffers_unregister()
> > > -> io_rsrc_data_free(), and frees data->nodes but data->nr is not yet
> > > updated
> > > * submitter task gets suspended through io_register_clone_buffers() ->
> > > lock_two_rings() -> mutex_lock_nested(&ctx2->uring_lock, ...)
> >
> > I think what this is missing is that the submitter task can't get
> > suspended at arbitrary points. It gets suspended in task work, and
> > task work only runs when returning from the kernel to userspace. At
>
> Ahh I see, thanks for the explanation. The documentation for
> TWA_SIGNAL in task_work_add() says "@TWA_SIGNAL works like signals, in
> that the it will interrupt the targeted task and run the task_work,
> regardless of whether the task is currently running in the kernel or
> userspace" so i had assumed this preempts the kernel.
Yeah, that documentation seems a bit misleading. Task work doesn't run
in interrupt context, otherwise it wouldn't be safe to take mutexes
like the uring lock. I think the comment is trying to say that
TWA_SIGNAL immediately kicks the task into the kernel, interrupting
any *userspace work*. But if the task is already in the kernel, it
won't run task work until returning to userspace. Though I could also
be misunderstanding how task work works.
Best,
Caleb
>
> Thanks,
> Joanne
>
> > which point "nothing" should be running on the task in userspace or
> > the kernel and it should be safe to run arbitrary task work items on
> > the task. Though Ming recently found an interesting deadlock caused by
> > acquiring a mutex in task work that runs on an unlucky ublk server
> > thread[1].
> >
> > [1] https://lore.kernel.org/linux-block/20251212143415.485359-1-ming.lei@redhat.com/
> >
> > Best,
> > Caleb
> >
> > > * after suspending the src ctx, -> io_clone_buffers() runs, which will
> > > get the incorrect "nbufs = src_ctx->buf_table.nr;" value
> > > * io_clone_buffers() calls io_rsrc_node_lookup() which will
> > > dereference a NULL pointer
> > >
> > > Thanks,
> > > Joanne
> > >
> > > > If the io_ring_ctx is IORING_SETUP_R_DISABLED (possible during
> > > > io_uring_setup(), io_uring_register(), or io_uring exit), submitter_task
> > > > may be set concurrently, so acquire the uring_lock before checking it.
> > > > If submitter_task isn't set yet, the uring_lock suffices to provide
> > > > mutual exclusion.
> > > >
> > > > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > > > Tested-by: syzbot@...kaller.appspotmail.com
> > > > ---
> > > > io_uring/io_uring.c | 12 +++++
> > > > io_uring/io_uring.h | 114 ++++++++++++++++++++++++++++++++++++++++++--
> > > > 2 files changed, 123 insertions(+), 3 deletions(-)
> > > >
Powered by blists - more mailing lists