[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whRxcmjvGNBKi9_x59cAedh8SO8wsNDNrEQbAQfM5A8CQ@mail.gmail.com>
Date: Mon, 29 Jan 2024 17:50:38 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org, Masami Hiramatsu <mhiramat@...nel.org>,
Mark Rutland <mark.rutland@....com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Christian Brauner <brauner@...nel.org>, Al Viro <viro@...iv.linux.org.uk>,
Ajay Kaher <ajay.kaher@...adcom.com>, linux-trace-kernel@...r.kernel.org
Subject: Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Mon, 29 Jan 2024 at 16:35, Steven Rostedt <rostedt@...dmis.org> wrote:
>
> # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
> # ls -l events/kprobes/
> ls: cannot access 'events/kprobes/': No such file or directory
>
> Where it should now exist but doesn't. But the lookup code never triggered.
>
> If the lookup fails, does it cache the result?
I think you end up still having negative dentries around.
The old code then tried to compensate for that by trying to remember
the old dentry with 'ei->dentry' and 'ei->d_children[]', and would at
lookup time try to use the *old* dentry instead of the new one.
And because dentries are just caches and can go away, it then had that
odd dance with '.d_iput', so that when a dentry was removed, it would
be removed from the 'ei->dentry' and 'ei->d_children[]' array too.
Except that d_iput() of an old dentry isn't actually serialized with
->d_lookup() in any way, so you end up with the whole race that I
already talked about earlier, where you could still have an
'ei->dentry' that pointed to something that had already been unhashed,
but d_iput() hadn't been called *yet*, so d_lookup() is called with a
new dentry, but the tracefs code then desperately tries to use the old
dentry pointer that just isn't _valid_ any more, but it doesn't know
that because d_iput() hasn't been called yet...
And as I *also* pointed out when I described that originally, you'll
practically never hit this race, because you just need to be *very*
unlucky with the whole "dentry is freed due to memory pressure".
But basically, this is why I absolutely *HATE* that "ei->dentry"
backpointer. It's truly fundamentally broken.
You can't reference-count it, since the whole point of your current
tracefs scheme is to *not* keep dentries and inodes around forever,
and doing a "dget()" on that 'ei->dentry' would thus fundamentally
screw that up.
But you also cannot keep it in sync with dentries being released due
to memory pressure, because of the above thing.
See why I've tried to tell you that the back-pointer is basically a
100% sign of a bug.
The *only* time you can have a valid dentry pointer is when you have
also taken a ref to it with dget(), and you can't do that.
So then you have all that completely broken code that _tries_ to
maintain consistency with ->d_children[] etc, and it works 99.9% in
practice, because the race is just so hard to hit because dentries
only normally get evicted either synchronously (which you do under the
eventfs_mutex) or under memory pressure (which is basically never
going to be something you can test).
And yes, my lookup patch removed all the band-aids for "if I have an
ei->dentry, I'll reuse it". So I think it ends up exposing all the
previous bugs that the old "let's reuse the old dentry" code tried to
hide.
But, as mentioned, that ei->dentry pointer really REALLY is broken.
NBow, having looked at this a lot, I think I have a way forward.
Because there is actually *one* case where you actually *do* do the
whole "dget()" to get a stable dentry pointer. And that's exactly the
"events" directory creation (ie eventfs_create_events_dir()).
So what I propose is that
- ei->dentry and ei->d_children[] need to die. Really. They are
buggy. There is no way to save them. There never was.
- but we *can* introduce a new 'ei->events_dir' pointer that is
*only* set by eventfs_create_events_dir(), and which is stable exactly
because that function also does a dget() on it, so now the dentry will
actually continue to exist reliably
I think that works. The only thing that actually *needs* the existing
'ei->dentry' is literally the eventfs_remove_events_dir() that gets
rid of the stable events directory. It's undoing
eventfs_create_events_dir(), and it will do the final dput() too.
I will try to make a patch for this. I do think it means that every
time we do that
dentry->d_fsdata = ei;
we need to also do proper reference counting of said 'ei'. Because we
can't release 'ei' early when we have dentries that point to it.
Let me see how painful this will be.
Linus
Powered by blists - more mailing lists