linux-kernel - io_uring: incorrect assumption about mutex behavior on unlock?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com>
Date:   Fri, 1 Dec 2023 17:41:13 +0100
From:   Jann Horn <jannh@...gle.com>
To:     Jens Axboe <axboe@...nel.dk>,
        Pavel Begunkov <asml.silence@...il.com>,
        io-uring <io-uring@...r.kernel.org>
Cc:     kernel list <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>
Subject: io_uring: incorrect assumption about mutex behavior on unlock?

mutex_unlock() has a different API contract compared to spin_unlock().
spin_unlock() can be used to release ownership of an object, so that
as soon as the spinlock is unlocked, another task is allowed to free
the object containing the spinlock.
mutex_unlock() does not support this kind of usage: The caller of
mutex_unlock() must ensure that the mutex stays alive until
mutex_unlock() has returned.
(See the thread
<https://lore.kernel.org/all/20231130204817.2031407-1-jannh@google.com/>
which discusses adding documentation about this.)
(POSIX userspace mutexes are different from kernel mutexes, in
userspace this pattern is allowed.)

io_ring_exit_work() has a comment that seems to assume that the
uring_lock (which is a mutex) can be used as if the spinlock-style API
contract applied:

    /*
    * Some may use context even when all refs and requests have been put,
    * and they are free to do so while still holding uring_lock or
    * completion_lock, see io_req_task_submit(). Apart from other work,
    * this lock/unlock section also waits them to finish.
    */
    mutex_lock(&ctx->uring_lock);

I couldn't find any way in which io_req_task_submit() actually still
relies on this. I think io_fallback_req_func() now relies on it,
though I'm not sure whether that's intentional. ctx->fallback_work is
flushed in io_ring_ctx_wait_and_kill(), but I think it can probably be
restarted later on via:

io_ring_exit_work -> io_move_task_work_from_local ->
io_req_normal_work_add -> io_fallback_tw(sync=false) ->
schedule_delayed_work

I think it is probably guaranteed that ctx->refs is non-zero when we
enter io_fallback_req_func, since I think we can't enter
io_fallback_req_func with an empty ctx->fallback_llist, and the
requests queued up on ctx->fallback_llist have to hold refcounted
references to the ctx. But by the time we reach the mutex_unlock(), I
think we're not guaranteed to hold any references on the ctx anymore,
and so the ctx could theoretically be freed in the middle of the
mutex_unlock() call?

I think that to make this code properly correct, it might be necessary
to either add another flush_delayed_work() call after ctx->refs has
dropped to zero and we know that the fallback work can't be restarted
anymore, or create an extra ctx->refs reference that is dropped in
io_fallback_req_func() after the mutex_unlock(). (Though I guess it's
probably unlikely that this goes wrong in practice.)