[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whhcoeTOZB_de-Nh28Fy4iNTu2DXzKXEPOubzL36+ME=A@mail.gmail.com>
Date: Thu, 6 Jan 2022 14:59:36 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Tejun Heo <tj@...nel.org>
Cc: Eric Biggers <ebiggers@...nel.org>,
Zefan Li <lizefan.x@...edance.com>,
Johannes Weiner <hannes@...xchg.org>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Ingo Molnar <mingo@...hat.com>,
Hillf Danton <hdanton@...a.com>,
syzbot <syzbot+cdb5dd11c97cc532efad@...kaller.appspotmail.com>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
Linux-MM <linux-mm@...ck.org>
Subject: Re: psi_trigger_poll() is completely broken
On Thu, Jan 6, 2022 at 2:05 PM Tejun Heo <tj@...nel.org> wrote:
>
> On Wed, Jan 05, 2022 at 11:13:30AM -0800, Linus Torvalds wrote:
> >
> > What are the users? Can we make the rule for -EBUSY simply be that you
> > can _install_ a trigger, but you can't replace an existing one (except
> > with NULL, when you close it).
>
> I don't have enough context here and Johannes seems offline today. Let's
> wait for him to chime in.
Context here:
https://lore.kernel.org/all/YdW3WfHURBXRmn%2F6@sol.localdomain/
> IIRC, the rationale for the shared trigger at the time was around the
> complexities of preventing it from devolving into O(N) trigger checks on
> every pressure update. If the overriding behavior is something that can be
> changed, I'd prefer going for per-opener triggers even if that involves
> adding complexities (maybe a rbtree w/ prev/next links for faster sweeps?).
So here's a COMPLETELY UNTESTED patch to try to fix the lifetime and locking.
The locking was completely broken, in that psi_trigger_replace()
expected that the caller would hold some exclusive lock so that it
would release the correct previous trigger. The cgroup code doesn't
seem to have any such exclusion.
This (UNTESTED!) patch fixes that breakage by just using a cmpxchg loop.
And the lifetime was completely broken (and that's Eric's email)
because psi_trigger_replace() would drop the refcount to the old
trigger - assuming it got the right one - even though the old trigger
could still have active waiters on the waitqueue due to poll() or
select().
This (UNTESTED!) patch fixes _that_ breakage by making
psi_trigger_replace() instead just put the previous trigger on the
"stale_trigger" linked list, and never release it at all.
It now gets released by "psi_trigger_release()" instead, which walks
the list at file release time. Doing "psi_trigger_replace(.., NULL)"
is not valid any more.
And because the reference cannot go away, we now can throw away all
the incorrect temporary kref_get/put games from psi_trigger_poll(),
which didn't actually fix the race at all, only limited it to the poll
waitqueue.
That also means we can remove the "synchronize_rcu()" from
psi_trigger_destroy(), since that was trying to hide all the problems
with the "take rcu lock and then do kref_get()" thing not having
locking. The locking still doesn't exist, but since we don't release
the old one when replacing it, the issue is moot.
NOTE NOTE NOTE! Not only is this patch entirely untested, there are
optimizations you could do if there was some sane synchronization
between psi_trigger_poll() and psi_trigger_replace(). I put comments
about it in the code, but right now the code just assumes that
replacing a trigger is fairly rare (and since it requires write
permissions, it's not something random users can do).
I'm not proud of this patch, but I think it might fix the fundamental
bugs in the code for now.
It's not lovely, it has room for improvement, and I wish we didn't
need this kind of thing, but it looks superficially sane as a fix to
me.
Comments?
And once again: this is UNTESTED. I've compiled-tested it, it looks
kind of sane to me, but honestly, I don't know the code very well.
Also, I'm not super-happy with how that 'psi_disabled' static branch
works. If somebody switches it off after it has been on, that will
also disable the freeing code, so now you'll be leaking memory.
I couldn't find it in myself to care.
Eric - you have the test-case, and the eagle-eyes that found this
problem in the first place. As such, your opinion and comments count
more than most...
Linus
View attachment "patch.diff" of type "text/x-patch" (5084 bytes)
Powered by blists - more mailing lists