[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHGeF7ko_4uXHUgl@tardis-2.local>
Date: Fri, 11 Jul 2025 16:28:23 -0700
From: Boqun Feng <boqun.feng@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Waiman Long <longman@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
linux-kernel@...r.kernel.org, Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH] locking/mutex: Add debug code to help catching violation
of mutex lifetime rule
On Fri, Jul 11, 2025 at 03:30:05PM -0700, Linus Torvalds wrote:
> On Fri, 11 Jul 2025 at 15:20, Boqun Feng <boqun.feng@...il.com> wrote:
> >
> > Meta question: are we able to construct a case that shows this can help
> > detect the issue?
>
> Well, the thing that triggered this was hopefully fixed by
> 8c2e52ebbe88 ("eventpoll: don't decrement ep refcount while still
> holding the ep mutex"), but I think Jann figured that one out by code
> inspection.
>
> I doubt it can be triggered in real life without something like
> Waiman's patch, but *with* Waiman's patch, and commit 8c2e52ebbe88
> reverted (and obviously with CONFIG_KASAN and CONFIG_DEBUG_MUTEXES
> enabled), doing lots of concurrent epoll closes would hopefully then
> trigger the warning.
>
> Of course, to then find *other* potential bugs would be the whole
> point, and some of these kinds of bugs are definitely of the kind
> where the race condition doesn't actually trigger in any real load,
> because it's unlikely that real loads end up doing that kind of
> "release all these objects concurrently".
>
> But it might be interesting to try that "can you even recreate the bug
> fixed by 8c2e52ebbe88" with this. Because if that one *known* bug
> can't be found by this, then it's obviously unlikely to help find
> others.
>
Yeah, I guess I asked the question because there is no clear link from
the bug scenario to an extra context switch, that is, even if the
context switch didn't happen, the bug would trigger if
__mutex_unlock_slowpath() took too long after giving the ownership to
someone else. So my instinct was: would cond_resched() be slow enough
;-)
But I agree it's a trivel thing to do, and I think another thing we can
do is adding a kasan_check_byte(lock) at the end of
__mutex_unlock_slowpath(), because conceptually the mutex should be
valid throughout the whole __mutex_unlock_slowpath() function, i.e.
void __mutex_unlock_slowpath(...)
{
...
raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
// <- conceptually "lock" should still be valid here.
// so if anyone free the memory of the mutex, it's going
// to be a problem.
kasan_check_byte(lock);
}
I think this may also give us a good chance of finding more bugs, one of
the reasons is that raw_spin_unlock_irqrestore_wake() has a
preempt_enable() at last, which may trigger a context switch.
Regards,
Boqun
> That said, it does seem like an obvious trivial thing to stress, which
> is why that patch by Waiman has my suggested-by...
>
> Linus
Powered by blists - more mailing lists