[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <A4DC14BA-74CA-41DB-BE08-D7B693C11AE0@gmail.com>
Date: Tue, 10 Aug 2021 01:28:01 -0700
From: Nadav Amit <nadav.amit@...il.com>
To: Olivier Langlois <olivier@...llion01.com>
Cc: Jens Axboe <axboe@...nel.dk>, io-uring@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Pavel Begunkov <asml.silence@...il.com>
Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task
work
> On Aug 9, 2021, at 2:48 PM, Olivier Langlois <olivier@...llion01.com> wrote:
>
> On Sat, 2021-08-07 at 17:13 -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@...are.com>
>>
>> When using SQPOLL, the submission queue polling thread calls
>> task_work_run() to run queued work. However, when work is added with
>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>> set afterwards and is never cleared.
>>
>> Consequently, when the submission queue polling thread checks whether
>> signal_pending(), it may always find a pending signal, if
>> task_work_add() was ever called before.
>>
>> The impact of this bug might be different on different kernel versions.
>> It appears that on 5.14 it would only cause unnecessary calculation and
>> prevent the polling thread from sleeping. On 5.13, where the bug was
>> found, it stops the polling thread from finding newly submitted work.
>>
>> Instead of task_work_run(), use tracehook_notify_signal() that clears
>> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
>> current->task_works to avoid a race in which task_works is cleared but
>> the TIF_NOTIFY_SIGNAL is set.
>
> thx a lot for this patch!
>
> This explains what I am seeing here:
> https://lore.kernel.org/io-uring/4d93d0600e4a9590a48d320c5a7dd4c54d66f095.camel@trillion01.com/
>
> I was under the impression that task_work_run() was clearing
> TIF_NOTIFY_SIGNAL.
>
> your patch made me realize that it does not…
Happy it could help.
Unfortunately, there seems to be yet another issue (unless my code
somehow caused it). It seems that when SQPOLL is used, there are cases
in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
never goes down to zero.
Debugging... (while also trying to make some progress with my code)
Powered by blists - more mailing lists