[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <245fba32-76cc-c4e1-6007-0b1f8a22a86b@kernel.dk>
Date: Fri, 8 Jan 2021 08:13:25 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Oleg Nesterov <oleg@...hat.com>,
Song Liu <songliubraving@...com>
Subject: Re: [PATCH] fs: process fput task_work with TWA_SIGNAL
On 1/7/21 11:46 PM, Al Viro wrote:
> On Fri, Jan 08, 2021 at 05:26:51AM +0000, Al Viro wrote:
>> On Tue, Jan 05, 2021 at 11:29:11AM -0700, Jens Axboe wrote:
>>> Song reported a boot regression in a kvm image with 5.11-rc, and bisected
>>> it down to the below patch. Debugging this issue, turns out that the boot
>>> stalled when a task is waiting on a pipe being released. As we no longer
>>> run task_work from get_signal() unless it's queued with TWA_SIGNAL, the
>>> task goes idle without running the task_work. This prevents ->release()
>>> from being called on the pipe, which another boot task is waiting on.
>>>
>>> Use TWA_SIGNAL for the file fput work to ensure it's run before the task
>>> goes idle.
>>>
>>> Fixes: 98b89b649fce ("signal: kill JOBCTL_TASK_WORK")
>>> Reported-by: Song Liu <songliubraving@...com>
>>> Signed-off-by: Jens Axboe <axboe@...nel.dk>
>>>
>>> ---
>>>
>>> The other alternative here is obviously to re-instate the:
>>>
>>> if (unlikely(current->task_works))
>>> task_work_run();
>>>
>>> in get_signal() that we had before this change. Might be safer in case
>>> there are other cases that need to ensure the work is run in a timely
>>> fashion, though I do think it's cleaner to long term to correctly mark
>>> task_work with the needed notification type. Comments welcome...
>>
>> Interesting... I think I've missed the discussion of that thing; could
>> you forward the relevant thread my way or give an archive link to it?
The initial report from Song was off list, and I just worked from that
to get to understanding the issue. Most of it is in the commit message,
but the debugging basically involved figuring out what the stuck task
was doing (it was in idle), and that it still had pending task_work. The
task_work was 5 entries of ____fput, with 4 being ext4 files, and 1
being a pipe. So that lead to the theory of the pipe not being released,
and hence why we were stuck.
> Actually, why do we need TWA_RESUME at all? OK, a while ago you've added
> a way for task_work_add() to do wake_up_signal(). Fine, so if the sucker
> had been asleep in get_signal(), it gets woken up and the work gets run
> fast. Irrelevant for those who did task_work_add() for themselves.
> With that commit, though, you've suddenly changed the default behaviour -
> now if you do that task_work_add() for current *and* get asleep in
> get_signal(), task_work_add() gets delayed - potentially for a very
> long time.
Right, this is why I brought up that we can re-instate the get_signal()
running task_work unconditionally as another way of fixing it, because
that change was inadvertently done as part of the commit that killed off
JOBCTL_TASK_WORK.
> Now the default (TWA_RESUME) has changed semantics; matter of fact,
> TWA_SIGNAL seems to be a lot closer than what we used to have. I'm
> too sleepy right now to check if there are valid usecases for your new
> TWA_RESUME behaviour, but I very much doubt that old callers (before
> the TWA_RESUME/TWA_SIGNAL split) want that.
>
> In particular, for mntput_no_expire() we definitely do *not* want
> that, same as with fput(). Same, AFAICS, for YAMA report_access().
> And for binder_deferred_fd_close(). And task_tick_numa() looks that
> way as well...
>
> Anyway, bedtime for me; right now it looks like at least for task ==
> current we always want TWA_SIGNAL. I'll look into that more tomorrow
> when I get up, but so far it smells like switching everything to
> TWA_SIGNAL would be the right thing to do, if not going back to bool
> notify for task_work_add()...
Before the change, the fact that we ran task_work off get_signal() and
thus processed even non-notify work in that path was a bit of a mess,
imho. If you have work that needs processing now, in the same manner as
signals, then you really should be using TWA_SIGNAL. For this pipe case,
and I'd need to setup and reproduce it again, the task must have a
signal pending and that would have previously caused the task_work to
run, and now it does not. TWA_RESUME technically didn't change its
behavior, it's still the same notification type, we just don't run
task_work unconditionally (regardless of notification type) from
get_signal().
I think the main question here is if we want to re-instate the behavior
of running task_work off get_signal(). I'm leaning towards not doing
that and ensuring that callers that DO need that are using TWA_SIGNAL.
--
Jens Axboe
Powered by blists - more mailing lists