lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 10 Aug 2021 22:32:59 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     Nadav Amit <nadav.amit@...il.com>,
        Olivier Langlois <olivier@...llion01.com>
Cc:     Jens Axboe <axboe@...nel.dk>, io-uring@...r.kernel.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task
 work

On 8/10/21 9:28 AM, Nadav Amit wrote:
>> On Aug 9, 2021, at 2:48 PM, Olivier Langlois <olivier@...llion01.com> wrote:
>> On Sat, 2021-08-07 at 17:13 -0700, Nadav Amit wrote:
>>> From: Nadav Amit <namit@...are.com>
>>>
>>> When using SQPOLL, the submission queue polling thread calls
>>> task_work_run() to run queued work. However, when work is added with
>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>> set afterwards and is never cleared.
>>>
>>> Consequently, when the submission queue polling thread checks whether
>>> signal_pending(), it may always find a pending signal, if
>>> task_work_add() was ever called before.
>>>
>>> The impact of this bug might be different on different kernel versions.
>>> It appears that on 5.14 it would only cause unnecessary calculation and
>>> prevent the polling thread from sleeping. On 5.13, where the bug was
>>> found, it stops the polling thread from finding newly submitted work.
>>>
>>> Instead of task_work_run(), use tracehook_notify_signal() that clears
>>> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
>>> current->task_works to avoid a race in which task_works is cleared but
>>> the TIF_NOTIFY_SIGNAL is set.
>>
>> thx a lot for this patch!
>>
>> This explains what I am seeing here:
>> https://lore.kernel.org/io-uring/4d93d0600e4a9590a48d320c5a7dd4c54d66f095.camel@trillion01.com/
>>
>> I was under the impression that task_work_run() was clearing
>> TIF_NOTIFY_SIGNAL.
>>
>> your patch made me realize that it does not…
> 
> Happy it could help.
> 
> Unfortunately, there seems to be yet another issue (unless my code
> somehow caused it). It seems that when SQPOLL is used, there are cases
> in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
> never goes down to zero.
> 
> Debugging... (while also trying to make some progress with my code)

It's most likely because a request has been lost (mis-refcounted).
Let us know if you need any help. Would be great to solve it for 5.14.
quick tips: 

1) if not already, try out Jens' 5.14 branch
git://git.kernel.dk/linux-block io_uring-5.14

2) try to characterise the io_uring use pattern. Poll requests?
Read/write requests? Send/recv? Filesystem vs bdev vs sockets?

If easily reproducible, you can match io_alloc_req() with it
getting into io_dismantle_req();

-- 
Pavel Begunkov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ