linux-kernel - Re: [syzbot] KCSAN: data-race in __io_uring_cancel / io_uring_try_cancel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <3dac0367-5a42-fdc5-f103-b508a692d130@gmail.com>
Date:   Thu, 27 May 2021 11:05:42 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     Marco Elver <elver@...gle.com>
Cc:     Jens Axboe <axboe@...nel.dk>,
        syzbot <syzbot+73554e2258b7b8bf0bbf@...kaller.appspotmail.com>,
        io-uring@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [syzbot] KCSAN: data-race in __io_uring_cancel /
 io_uring_try_cancel_requests

On 5/27/21 10:32 AM, Marco Elver wrote:
> On Wed, May 26, 2021 at 09:31PM +0100, Pavel Begunkov wrote:
>> On 5/26/21 5:36 PM, Marco Elver wrote:
>>> On Wed, 26 May 2021 at 18:29, Pavel Begunkov <asml.silence@...il.com> wrote:
>>>> On 5/26/21 4:52 PM, Marco Elver wrote:
>>>>> Due to some moving around of code, the patch lost the actual fix (using
>>>>> atomically read io_wq) -- so here it is again ... hopefully as intended.
>>>>> :-)
>>>>
>>>> "fortify" damn it... It was synchronised with &ctx->uring_lock
>>>> before, see io_uring_try_cancel_iowq() and io_uring_del_tctx_node(),
>>>> so should not clear before *del_tctx_node()
>>>
>>> Ah, so if I understand right, the property stated by the comment in
>>> io_uring_try_cancel_iowq() was broken, and your patch below would fix
>>> that, right?
>>
>> "io_uring: fortify tctx/io_wq cleanup" broke it and the diff
>> should fix it.
>>
>>>> The fix should just move it after this sync point. Will you send
>>>> it out as a patch?
>>>
>>> Do you mean your move of write to io_wq goes on top of the patch I
>>> proposed? (If so, please also leave your Signed-of-by so I can squash
>>> it.)
>>
>> No, only my diff, but you hinted on what has happened, so I would
>> prefer you to take care of patching. If you want of course.
>>
>> To be entirely fair, assuming that aligned ptr
>> reads can't be torn, I don't see any _real_ problem. But surely
>> the report is very helpful and the current state is too wonky, so
>> should be patched.
> 
> In the current version, it is a problem if we end up with a double-read,
> as it is in the current C code. The compiler might of course optimize
> it into 1 read into a register.

Absolutely agree on that

> Tangent: I avoid reasoning in terms of compiler optimizations where
> I can. :-) It's is a slippery slope if the code in question isn't
> tolerant to data races by design (examples are stats counting, or other
> heuristics -- in the case here that's certainly not the case).
> Therefore, my wish is that we really ought to resolve as many data races
> as we can (+ mark intentional ones appropriately). Also, so that we're
> left with only the interesting cases like in the case here.  (More
> background if you're interested: https://lwn.net/Articles/816850/)
> 
> The problem here, however, has a nicer resolution as you suggested.
> 
>> TL;DR;
>> The synchronisation goes as this: it's usually used by the owner
>> task, and the owner task deletes it, so is mostly naturally
>> synchronised. An exception is a worker (not only) that accesses
>> it for cancellation purpose, but it uses it only under ->uring_lock,
>> so if removal is also taking the lock it should be fine. see
>> io_uring_del_tctx_node() locking.
> 
> Did you mean io_uring_del_task_file()? There is no
> io_uring_del_tctx_node().

Ah, yes, that's from patches I sent for next.

-- 
Pavel Begunkov