linux-kernel - Re: [syzbot] KCSAN: data-race in __io_uring_cancel / io_uring_try_cancel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5cf2250a-c580-4dbf-5997-e987c7b71086@gmail.com>
Date:   Wed, 26 May 2021 21:31:20 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     Marco Elver <elver@...gle.com>
Cc:     Jens Axboe <axboe@...nel.dk>,
        syzbot <syzbot+73554e2258b7b8bf0bbf@...kaller.appspotmail.com>,
        io-uring@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [syzbot] KCSAN: data-race in __io_uring_cancel /
 io_uring_try_cancel_requests

On 5/26/21 5:36 PM, Marco Elver wrote:
> On Wed, 26 May 2021 at 18:29, Pavel Begunkov <asml.silence@...il.com> wrote:
>> On 5/26/21 4:52 PM, Marco Elver wrote:
>>> Due to some moving around of code, the patch lost the actual fix (using
>>> atomically read io_wq) -- so here it is again ... hopefully as intended.
>>> :-)
>>
>> "fortify" damn it... It was synchronised with &ctx->uring_lock
>> before, see io_uring_try_cancel_iowq() and io_uring_del_tctx_node(),
>> so should not clear before *del_tctx_node()
> 
> Ah, so if I understand right, the property stated by the comment in
> io_uring_try_cancel_iowq() was broken, and your patch below would fix
> that, right?

"io_uring: fortify tctx/io_wq cleanup" broke it and the diff
should fix it.

>> The fix should just move it after this sync point. Will you send
>> it out as a patch?
> 
> Do you mean your move of write to io_wq goes on top of the patch I
> proposed? (If so, please also leave your Signed-of-by so I can squash
> it.)

No, only my diff, but you hinted on what has happened, so I would
prefer you to take care of patching. If you want of course.

To be entirely fair, assuming that aligned ptr
reads can't be torn, I don't see any _real_ problem. But surely
the report is very helpful and the current state is too wonky, so
should be patched.

TL;DR;
The synchronisation goes as this: it's usually used by the owner
task, and the owner task deletes it, so is mostly naturally
synchronised. An exception is a worker (not only) that accesses
it for cancellation purpose, but it uses it only under ->uring_lock,
so if removal is also taking the lock it should be fine. see
io_uring_del_tctx_node() locking.

> 
> So if I understand right, we do in fact have 2 problems:
> 1. the data race as I noted in my patch, and

Yes, and it deals with it

> 2. the fact that io_wq does not live long enough.

Nope, io_wq outlives them fine. 

> Did I get it right?
> 
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7db6aaf31080..b76ba26b4c6c 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -9075,11 +9075,12 @@ static void io_uring_clean_tctx(struct io_uring_task *tctx)
>>         struct io_tctx_node *node;
>>         unsigned long index;
>>
>> -       tctx->io_wq = NULL;
>>         xa_for_each(&tctx->xa, index, node)
>>                 io_uring_del_tctx_node(index);
>> -       if (wq)
>> +       if (wq) {
>> +               tctx->io_wq = NULL;
>>                 io_wq_put_and_exit(wq);
>> +       }
>>  }
>>
>>  static s64 tctx_inflight(struct io_uring_task *tctx, bool tracked)

-- 
Pavel Begunkov