linux-kernel - Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251009114302.GI3245006@noisy.programming.kicks-ass.net>
Date: Thu, 9 Oct 2025 13:43:02 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>,
	Joel Fernandes <joelagnelf@...dia.com>,
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
	Boqun Feng <boqun.feng@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Metin Kaya <Metin.Kaya@....com>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Daniel Lezcano <daniel.lezcano@...aro.org>,
	Suleiman Souhlal <suleiman@...gle.com>,
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
	kernel-team@...roid.com
Subject: Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide
 necessary tri-state for proxy return-migration

On Wed, Oct 08, 2025 at 05:07:26PM -0700, John Stultz wrote:

> > I'm thinking that suggesting didn't work out somehow, but I'm still not
> > sure I understand why.
> 
> So the main issue is about where we end up setting the task to
> BO_WAKING (via set_blocked_on_waking()). This is done in
> __mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
> And in those cases, we are already holding the mutex->wait_lock, and
> sometimes a task's blocked_lock, without the rq lock.  So we can't
> just grab the rq lock out of order, and we probably shouldn't drop and
> try to reacquire the blocked_lock and wait_lock there.

Oh bugger. In my head the scheduler locks still nest inside wait_lock,
but we've flipped that such that schedule() / find_proxy_task() can take
it inside rq->lock.

Yes that does complicate things.

So suppose we have this ww_mutex cycle thing:

		  ,-+-*	Mutex-1 <-.
	Task-A ---' |		  | ,--	Task-B
		    `->	Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can back-off and
   re-try the transaction.


Now, the current code does:

	__clear_task_blocked_on();
	wake_q_add();

And surely clearing ->blocked_on is sufficient to break the cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly pick
Task-B and run it, without ever getting into find_proxy_task().

Now, here is a problem because Task-B might not be runnable on the CPU
it is currently on; and because !task_is_blocked() we don't get into the
proxy paths, so nobody is going to fix this up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, lock inversion spoils things.

> Though, one approach that I just thought of would be to have a special
> wake_up_q call, which would handle dequeuing the blocked_on tasks on
> the wake_q before doing the wakeup?  I can give that a try.

I think this is racy worse than you considered. CPU1 could be inside
schedule() trying to pick Task-B while CPU2 does that wound/die thing.
No spurious wakeup required.


Anyway, since the actual value of ->blocked_on doesn't matter in this
case (we really want it to be NULL, but can't because we need someone to
go back migrate the thing), why not simply use something like:

#define PROXY_STOP ((struct mutex *)(-1L))

	__set_task_blocked_on(task, PROXY_STOP);

Then, have find_proxy_task() fix it up?


Random thoughts:

 - we should probably have something like:

	next = pick_next_task();
	rq_set_donor(next)
	if (unlikely(task_is_blocked()) {
		...
	}
+	WARN_ON_ONCE(next->__state);

   at all times the task we end up picking should be in RUNNABLE state.

 - similarly, we should have ttwu() check ->blocked_on is NULL ||
   PROXY_STOP, waking a task that still has a blocked_on relation can't
   be right -- ooh, dang race conditions :/ perhaps DEBUG_MUTEX and
   serialize on wait_lock.

 - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
   I think, but might improve performance -- if so, include numbers in
   patch that adds it -- which should be a separate patch from the one
   that adds PROXY_STOP.

 - since find_proxy_task() can do a lock-break, it should probably
   re-try the pick if, at the end, a higher runqueue is modified than
   the task we ended up with.

   Also see this thread:

      https://lkml.kernel.org/r/20251006105453.522934521@infradead.org

   eg. something like:

	rq->queue_mask = 0;
	// code with rq-lock-break
   	if (rq_modified_above(rq, next->sched_class))
		return NULL;


I'm still confused on BO_RUNNABLE -- you set that around
optimistic-spin, probably because you want to retain the ->blocked_on
relation, but also you have to run that thing to make progress. There
are a few other sites that use it, but those are more confusing still.

Please try and clarify this.

Anyway, if that is indeed it, you could do this by (ab)using the LSB of
the ->blocked_on pointer I suppose (you could make PROXY_STOP -2).