lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 8 Jan 2021 08:13:25 -0700
From:   Jens Axboe <axboe@...nel.dk>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Song Liu <songliubraving@...com>
Subject: Re: [PATCH] fs: process fput task_work with TWA_SIGNAL

On 1/7/21 11:46 PM, Al Viro wrote:
> On Fri, Jan 08, 2021 at 05:26:51AM +0000, Al Viro wrote:
>> On Tue, Jan 05, 2021 at 11:29:11AM -0700, Jens Axboe wrote:
>>> Song reported a boot regression in a kvm image with 5.11-rc, and bisected
>>> it down to the below patch. Debugging this issue, turns out that the boot
>>> stalled when a task is waiting on a pipe being released. As we no longer
>>> run task_work from get_signal() unless it's queued with TWA_SIGNAL, the
>>> task goes idle without running the task_work. This prevents ->release()
>>> from being called on the pipe, which another boot task is waiting on.
>>>
>>> Use TWA_SIGNAL for the file fput work to ensure it's run before the task
>>> goes idle.
>>>
>>> Fixes: 98b89b649fce ("signal: kill JOBCTL_TASK_WORK")
>>> Reported-by: Song Liu <songliubraving@...com>
>>> Signed-off-by: Jens Axboe <axboe@...nel.dk>
>>>
>>> ---
>>>
>>> The other alternative here is obviously to re-instate the:
>>>
>>> if (unlikely(current->task_works))
>>> 	task_work_run();
>>>
>>> in get_signal() that we had before this change. Might be safer in case
>>> there are other cases that need to ensure the work is run in a timely
>>> fashion, though I do think it's cleaner to long term to correctly mark
>>> task_work with the needed notification type. Comments welcome...
>>
>> Interesting...  I think I've missed the discussion of that thing; could
>> you forward the relevant thread my way or give an archive link to it?

The initial report from Song was off list, and I just worked from that
to get to understanding the issue. Most of it is in the commit message,
but the debugging basically involved figuring out what the stuck task
was doing (it was in idle), and that it still had pending task_work. The
task_work was 5 entries of ____fput, with 4 being ext4 files, and 1
being a pipe. So that lead to the theory of the pipe not being released,
and hence why we were stuck.

> Actually, why do we need TWA_RESUME at all?  OK, a while ago you've added
> a way for task_work_add() to do wake_up_signal().  Fine, so if the sucker
> had been asleep in get_signal(), it gets woken up and the work gets run
> fast.  Irrelevant for those who did task_work_add() for themselves.
> With that commit, though, you've suddenly changed the default behaviour -
> now if you do that task_work_add() for current *and* get asleep in
> get_signal(), task_work_add() gets delayed - potentially for a very
> long time.

Right, this is why I brought up that we can re-instate the get_signal()
running task_work unconditionally as another way of fixing it, because
that change was inadvertently done as part of the commit that killed off
JOBCTL_TASK_WORK.

> Now the default (TWA_RESUME) has changed semantics; matter of fact,
> TWA_SIGNAL seems to be a lot closer than what we used to have.  I'm
> too sleepy right now to check if there are valid usecases for your new
> TWA_RESUME behaviour, but I very much doubt that old callers (before
> the TWA_RESUME/TWA_SIGNAL split) want that.
> 
> In particular, for mntput_no_expire() we definitely do *not* want
> that, same as with fput().  Same, AFAICS, for YAMA report_access().
> And for binder_deferred_fd_close().  And task_tick_numa() looks that
> way as well...
> 
> Anyway, bedtime for me; right now it looks like at least for task ==
> current we always want TWA_SIGNAL.  I'll look into that more tomorrow
> when I get up, but so far it smells like switching everything to
> TWA_SIGNAL would be the right thing to do, if not going back to bool
> notify for task_work_add()...

Before the change, the fact that we ran task_work off get_signal() and
thus processed even non-notify work in that path was a bit of a mess,
imho. If you have work that needs processing now, in the same manner as
signals, then you really should be using TWA_SIGNAL. For this pipe case,
and I'd need to setup and reproduce it again, the task must have a
signal pending and that would have previously caused the task_work to
run, and now it does not. TWA_RESUME technically didn't change its
behavior, it's still the same notification type, we just don't run
task_work unconditionally (regardless of notification type) from
get_signal().

I think the main question here is if we want to re-instate the behavior
of running task_work off get_signal(). I'm leaning towards not doing
that and ensuring that callers that DO need that are using TWA_SIGNAL.

-- 
Jens Axboe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ