[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e88452ed-411b-419e-a41d-0ff98d54a59b@redhat.com>
Date: Fri, 11 Jul 2025 20:42:36 -0400
From: Waiman Long <llong@...hat.com>
To: Boqun Feng <boqun.feng@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Will Deacon <will@...nel.org>, linux-kernel@...r.kernel.org,
Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH] locking/mutex: Add debug code to help catching violation
of mutex lifetime rule
On 7/11/25 7:28 PM, Boqun Feng wrote:
> On Fri, Jul 11, 2025 at 03:30:05PM -0700, Linus Torvalds wrote:
>> On Fri, 11 Jul 2025 at 15:20, Boqun Feng <boqun.feng@...il.com> wrote:
>>> Meta question: are we able to construct a case that shows this can help
>>> detect the issue?
>> Well, the thing that triggered this was hopefully fixed by
>> 8c2e52ebbe88 ("eventpoll: don't decrement ep refcount while still
>> holding the ep mutex"), but I think Jann figured that one out by code
>> inspection.
>>
>> I doubt it can be triggered in real life without something like
>> Waiman's patch, but *with* Waiman's patch, and commit 8c2e52ebbe88
>> reverted (and obviously with CONFIG_KASAN and CONFIG_DEBUG_MUTEXES
>> enabled), doing lots of concurrent epoll closes would hopefully then
>> trigger the warning.
>>
>> Of course, to then find *other* potential bugs would be the whole
>> point, and some of these kinds of bugs are definitely of the kind
>> where the race condition doesn't actually trigger in any real load,
>> because it's unlikely that real loads end up doing that kind of
>> "release all these objects concurrently".
>>
>> But it might be interesting to try that "can you even recreate the bug
>> fixed by 8c2e52ebbe88" with this. Because if that one *known* bug
>> can't be found by this, then it's obviously unlikely to help find
>> others.
>>
> Yeah, I guess I asked the question because there is no clear link from
> the bug scenario to an extra context switch, that is, even if the
> context switch didn't happen, the bug would trigger if
> __mutex_unlock_slowpath() took too long after giving the ownership to
> someone else. So my instinct was: would cond_resched() be slow enough
> ;-)
>
> But I agree it's a trivel thing to do, and I think another thing we can
> do is adding a kasan_check_byte(lock) at the end of
> __mutex_unlock_slowpath(), because conceptually the mutex should be
> valid throughout the whole __mutex_unlock_slowpath() function, i.e.
>
> void __mutex_unlock_slowpath(...)
> {
> ...
> raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
> // <- conceptually "lock" should still be valid here.
> // so if anyone free the memory of the mutex, it's going
> // to be a problem.
> kasan_check_byte(lock);
> }
>
> I think this may also give us a good chance of finding more bugs, one of
> the reasons is that raw_spin_unlock_irqrestore_wake() has a
> preempt_enable() at last, which may trigger a context switch.
>
> Regards,
> Boqun
I think this is a good idea. We should extend that to add the check in
rwsem as well. Will a post a patch to do that.
Cheers,
Longman
>> That said, it does seem like an obvious trivial thing to stress, which
>> is why that patch by Waiman has my suggested-by...
>>
>> Linus
Powered by blists - more mailing lists