linux-kernel - Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide necessary tri-state for proxy return-migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCp6TPcfzJLCU-o-xScfoeCqMzYjVhGyDa23YjAqiMnA0w@mail.gmail.com>
Date: Mon, 13 Oct 2025 19:43:12 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, 
	kernel-team@...roid.com
Subject: Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide
 necessary tri-state for proxy return-migration

On Thu, Oct 9, 2025 at 4:43 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Wed, Oct 08, 2025 at 05:07:26PM -0700, John Stultz wrote:
>
> > > I'm thinking that suggesting didn't work out somehow, but I'm still not
> > > sure I understand why.
> >
> > So the main issue is about where we end up setting the task to
> > BO_WAKING (via set_blocked_on_waking()). This is done in
> > __mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
> > And in those cases, we are already holding the mutex->wait_lock, and
> > sometimes a task's blocked_lock, without the rq lock.  So we can't
> > just grab the rq lock out of order, and we probably shouldn't drop and
> > try to reacquire the blocked_lock and wait_lock there.
>
> Oh bugger. In my head the scheduler locks still nest inside wait_lock,
> but we've flipped that such that schedule() / find_proxy_task() can take
> it inside rq->lock.
>
> Yes that does complicate things.
>
> So suppose we have this ww_mutex cycle thing:
>
>                   ,-+-* Mutex-1 <-.
>         Task-A ---' |             | ,-- Task-B
>                     `-> Mutex-2 *-+-'
>
> Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
> where Task-B holds Mutex-2 and tries to acquire Mutex-1.
>
> Then the blocked_on->owner chain will go in circles.
>
>         Task-A  -> Mutex-2
>           ^          |
>           |          v
>         Mutex-1 <- Task-B
>
> We need two things:
>
>  - find_proxy_task() to stop iterating the circle;
>
>  - the woken task to 'unblock' and run, such that it can back-off and
>    re-try the transaction.
>
>
> Now, the current code does:
>
>         __clear_task_blocked_on();
>         wake_q_add();
>
> And surely clearing ->blocked_on is sufficient to break the cycle.
>
> Suppose it is Task-B that is made to back-off, then we have:
>
>   Task-A -> Mutex-2 -> Task-B (no further blocked_on)
>
> and it would attempt to run Task-B. Or worse, it could directly pick
> Task-B and run it, without ever getting into find_proxy_task().
>
> Now, here is a problem because Task-B might not be runnable on the CPU
> it is currently on; and because !task_is_blocked() we don't get into the
> proxy paths, so nobody is going to fix this up.
>
> Ideally we would have dequeued Task-B alongside of clearing
> ->blocked_on, but alas, lock inversion spoils things.

Right. Thus my adding of the blocked_on_state to try to gate the task
from running until we evaluate it for return migration.

> > Though, one approach that I just thought of would be to have a special
> > wake_up_q call, which would handle dequeuing the blocked_on tasks on
> > the wake_q before doing the wakeup?  I can give that a try.
>
> I think this is racy worse than you considered. CPU1 could be inside
> schedule() trying to pick Task-B while CPU2 does that wound/die thing.
> No spurious wakeup required.

Yeah. I took a bit of a try at it, but couldn't manage to rework
things without preserving the BO_WAKING guard.

Then trying to do the dequeue in the wake_up_q() really isn't that far
away from just doing it in ttwu() a little deeper in the call stack,
as we still have to take  task_rq_lock() to call block_task().

> Anyway, since the actual value of ->blocked_on doesn't matter in this
> case (we really want it to be NULL, but can't because we need someone to
> go back migrate the thing), why not simply use something like:
>
> #define PROXY_STOP ((struct mutex *)(-1L))
>
>         __set_task_blocked_on(task, PROXY_STOP);
>
> Then, have find_proxy_task() fix it up?

Ok, so this sounds like it sort of matches the BO_WAKING state I
currently have (replacing the BO_WAKING state with PROXY_STOP). Not
much of a logic change, but would indeed save a bit of space.
I'll take a stab at it.

> Random thoughts:
>
>  - we should probably have something like:
>
>         next = pick_next_task();
>         rq_set_donor(next)
>         if (unlikely(task_is_blocked()) {
>                 ...
>         }
> +       WARN_ON_ONCE(next->__state);
>
>    at all times the task we end up picking should be in RUNNABLE state.
>
>  - similarly, we should have ttwu() check ->blocked_on is NULL ||
>    PROXY_STOP, waking a task that still has a blocked_on relation can't
>    be right -- ooh, dang race conditions :/ perhaps DEBUG_MUTEX and
>    serialize on wait_lock.
>
>  - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
>    I think, but might improve performance -- if so, include numbers in
>    patch that adds it -- which should be a separate patch from the one
>    that adds PROXY_STOP.

Ok, I'll work to split that logic out. The nice thing in ttwu is we
already end up taking the rq lock in ttwu_runnable() when we do the
dequeue so yeah I expect it would help performance.

>  - since find_proxy_task() can do a lock-break, it should probably
>    re-try the pick if, at the end, a higher runqueue is modified than
>    the task we ended up with.

So, I think find_proxy_task() will always pick-again if it releases
the rqlock.  So I'm not sure I'm quite following this bit. Could you
clarify?

>    Also see this thread:
>
>       https://lkml.kernel.org/r/20251006105453.522934521@infradead.org
>
>    eg. something like:
>
>         rq->queue_mask = 0;
>         // code with rq-lock-break
>         if (rq_modified_above(rq, next->sched_class))
>                 return NULL;
>
>
> I'm still confused on BO_RUNNABLE -- you set that around
> optimistic-spin, probably because you want to retain the ->blocked_on
> relation, but also you have to run that thing to make progress. There
> are a few other sites that use it, but those are more confusing still.

Mostly I liked managing the blocked_on_state separately from the
blocked_on pointer as I found it simplified my thinking in the mutex
lock side, for cases where we wake up and then loop again. But let me
take a pass at reworking it a bit like you suggest to see how it goes.

> Please try and clarify this.

Will try to add more comments to explain.

> Anyway, if that is indeed it, you could do this by (ab)using the LSB of
> the ->blocked_on pointer I suppose (you could make PROXY_STOP -2).

One complication for using the LSB of the pointer, Suleiman was
thinking about stealing those for extending the blocked_on pointer for
use with other lock types (rw_sem in his case).
Currently he's got it in a structure with an enum:
  https://github.com/johnstultz-work/linux-dev/commit/e61b487d240782302199f6dc1d99851c3449b547

But we talked a little about potentially squishing that together, but
it sort of depends on how many primitive types we end up using
proxy-exec with.

As always, thanks again for all the feedback, I really appreciate it!
-john