[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCp6TPcfzJLCU-o-xScfoeCqMzYjVhGyDa23YjAqiMnA0w@mail.gmail.com>
Date: Mon, 13 Oct 2025 19:43:12 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>,
Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>,
Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>,
"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>,
Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>,
Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>,
Suleiman Souhlal <suleiman@...gle.com>, kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
kernel-team@...roid.com
Subject: Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide
necessary tri-state for proxy return-migration
On Thu, Oct 9, 2025 at 4:43 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Wed, Oct 08, 2025 at 05:07:26PM -0700, John Stultz wrote:
>
> > > I'm thinking that suggesting didn't work out somehow, but I'm still not
> > > sure I understand why.
> >
> > So the main issue is about where we end up setting the task to
> > BO_WAKING (via set_blocked_on_waking()). This is done in
> > __mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
> > And in those cases, we are already holding the mutex->wait_lock, and
> > sometimes a task's blocked_lock, without the rq lock. So we can't
> > just grab the rq lock out of order, and we probably shouldn't drop and
> > try to reacquire the blocked_lock and wait_lock there.
>
> Oh bugger. In my head the scheduler locks still nest inside wait_lock,
> but we've flipped that such that schedule() / find_proxy_task() can take
> it inside rq->lock.
>
> Yes that does complicate things.
>
> So suppose we have this ww_mutex cycle thing:
>
> ,-+-* Mutex-1 <-.
> Task-A ---' | | ,-- Task-B
> `-> Mutex-2 *-+-'
>
> Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
> where Task-B holds Mutex-2 and tries to acquire Mutex-1.
>
> Then the blocked_on->owner chain will go in circles.
>
> Task-A -> Mutex-2
> ^ |
> | v
> Mutex-1 <- Task-B
>
> We need two things:
>
> - find_proxy_task() to stop iterating the circle;
>
> - the woken task to 'unblock' and run, such that it can back-off and
> re-try the transaction.
>
>
> Now, the current code does:
>
> __clear_task_blocked_on();
> wake_q_add();
>
> And surely clearing ->blocked_on is sufficient to break the cycle.
>
> Suppose it is Task-B that is made to back-off, then we have:
>
> Task-A -> Mutex-2 -> Task-B (no further blocked_on)
>
> and it would attempt to run Task-B. Or worse, it could directly pick
> Task-B and run it, without ever getting into find_proxy_task().
>
> Now, here is a problem because Task-B might not be runnable on the CPU
> it is currently on; and because !task_is_blocked() we don't get into the
> proxy paths, so nobody is going to fix this up.
>
> Ideally we would have dequeued Task-B alongside of clearing
> ->blocked_on, but alas, lock inversion spoils things.
Right. Thus my adding of the blocked_on_state to try to gate the task
from running until we evaluate it for return migration.
> > Though, one approach that I just thought of would be to have a special
> > wake_up_q call, which would handle dequeuing the blocked_on tasks on
> > the wake_q before doing the wakeup? I can give that a try.
>
> I think this is racy worse than you considered. CPU1 could be inside
> schedule() trying to pick Task-B while CPU2 does that wound/die thing.
> No spurious wakeup required.
Yeah. I took a bit of a try at it, but couldn't manage to rework
things without preserving the BO_WAKING guard.
Then trying to do the dequeue in the wake_up_q() really isn't that far
away from just doing it in ttwu() a little deeper in the call stack,
as we still have to take task_rq_lock() to call block_task().
> Anyway, since the actual value of ->blocked_on doesn't matter in this
> case (we really want it to be NULL, but can't because we need someone to
> go back migrate the thing), why not simply use something like:
>
> #define PROXY_STOP ((struct mutex *)(-1L))
>
> __set_task_blocked_on(task, PROXY_STOP);
>
> Then, have find_proxy_task() fix it up?
Ok, so this sounds like it sort of matches the BO_WAKING state I
currently have (replacing the BO_WAKING state with PROXY_STOP). Not
much of a logic change, but would indeed save a bit of space.
I'll take a stab at it.
> Random thoughts:
>
> - we should probably have something like:
>
> next = pick_next_task();
> rq_set_donor(next)
> if (unlikely(task_is_blocked()) {
> ...
> }
> + WARN_ON_ONCE(next->__state);
>
> at all times the task we end up picking should be in RUNNABLE state.
>
> - similarly, we should have ttwu() check ->blocked_on is NULL ||
> PROXY_STOP, waking a task that still has a blocked_on relation can't
> be right -- ooh, dang race conditions :/ perhaps DEBUG_MUTEX and
> serialize on wait_lock.
>
> - I'm confliced on having TTWU fix up PROXY_STOP; strictly not required
> I think, but might improve performance -- if so, include numbers in
> patch that adds it -- which should be a separate patch from the one
> that adds PROXY_STOP.
Ok, I'll work to split that logic out. The nice thing in ttwu is we
already end up taking the rq lock in ttwu_runnable() when we do the
dequeue so yeah I expect it would help performance.
> - since find_proxy_task() can do a lock-break, it should probably
> re-try the pick if, at the end, a higher runqueue is modified than
> the task we ended up with.
So, I think find_proxy_task() will always pick-again if it releases
the rqlock. So I'm not sure I'm quite following this bit. Could you
clarify?
> Also see this thread:
>
> https://lkml.kernel.org/r/20251006105453.522934521@infradead.org
>
> eg. something like:
>
> rq->queue_mask = 0;
> // code with rq-lock-break
> if (rq_modified_above(rq, next->sched_class))
> return NULL;
>
>
> I'm still confused on BO_RUNNABLE -- you set that around
> optimistic-spin, probably because you want to retain the ->blocked_on
> relation, but also you have to run that thing to make progress. There
> are a few other sites that use it, but those are more confusing still.
Mostly I liked managing the blocked_on_state separately from the
blocked_on pointer as I found it simplified my thinking in the mutex
lock side, for cases where we wake up and then loop again. But let me
take a pass at reworking it a bit like you suggest to see how it goes.
> Please try and clarify this.
Will try to add more comments to explain.
> Anyway, if that is indeed it, you could do this by (ab)using the LSB of
> the ->blocked_on pointer I suppose (you could make PROXY_STOP -2).
One complication for using the LSB of the pointer, Suleiman was
thinking about stealing those for extending the blocked_on pointer for
use with other lock types (rw_sem in his case).
Currently he's got it in a structure with an enum:
https://github.com/johnstultz-work/linux-dev/commit/e61b487d240782302199f6dc1d99851c3449b547
But we talked a little about potentially squishing that together, but
it sort of depends on how many primitive types we end up using
proxy-exec with.
As always, thanks again for all the feedback, I really appreciate it!
-john
Powered by blists - more mailing lists