lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHGeF7ko_4uXHUgl@tardis-2.local>
Date: Fri, 11 Jul 2025 16:28:23 -0700
From: Boqun Feng <boqun.feng@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Waiman Long <longman@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
	linux-kernel@...r.kernel.org, Jann Horn <jannh@...gle.com>
Subject: Re: [PATCH] locking/mutex: Add debug code to help catching violation
 of mutex lifetime rule

On Fri, Jul 11, 2025 at 03:30:05PM -0700, Linus Torvalds wrote:
> On Fri, 11 Jul 2025 at 15:20, Boqun Feng <boqun.feng@...il.com> wrote:
> >
> > Meta question: are we able to construct a case that shows this can help
> > detect the issue?
> 
> Well, the thing that triggered this was hopefully fixed by
> 8c2e52ebbe88 ("eventpoll: don't decrement ep refcount while still
> holding the ep mutex"), but I think Jann figured that one out by code
> inspection.
> 
> I doubt it can be triggered in real life without something like
> Waiman's patch, but *with* Waiman's patch, and commit 8c2e52ebbe88
> reverted (and obviously with CONFIG_KASAN and CONFIG_DEBUG_MUTEXES
> enabled), doing lots of concurrent epoll closes would hopefully then
> trigger the warning.
> 
> Of course, to then find *other* potential bugs would be the whole
> point, and some of these kinds of bugs are definitely of the kind
> where the race condition doesn't actually trigger in any real load,
> because it's unlikely that real loads end up doing that kind of
> "release all these objects concurrently".
> 
> But it might be interesting to try that "can you even recreate the bug
> fixed by 8c2e52ebbe88" with this. Because if that one *known* bug
> can't be found by this, then it's obviously unlikely to help find
> others.
> 

Yeah, I guess I asked the question because there is no clear link from
the bug scenario to an extra context switch, that is, even if the
context switch didn't happen, the bug would trigger if
__mutex_unlock_slowpath() took too long after giving the ownership to
someone else. So my instinct was: would cond_resched() be slow enough
;-)

But I agree it's a trivel thing to do, and I think another thing we can
do is adding a kasan_check_byte(lock) at the end of
__mutex_unlock_slowpath(), because conceptually the mutex should be
valid throughout the whole __mutex_unlock_slowpath() function, i.e.

	void __mutex_unlock_slowpath(...)
	{
		...
		raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
		// <- conceptually "lock" should still be valid here.
		// so if anyone free the memory of the mutex, it's going
		// to be a problem.
		kasan_check_byte(lock);
	}

I think this may also give us a good chance of finding more bugs, one of
the reasons is that raw_spin_unlock_irqrestore_wake() has a
preempt_enable() at last, which may trigger a context switch.

Regards,
Boqun

> That said, it does seem like an obvious trivial thing to stress, which
> is why that patch by Waiman has my suggested-by...
> 
>           Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ