lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCpSNyOkcuUZvpZK5FQhL8oxQEax2wUStdRAV_ns2z7n_A@mail.gmail.com>
Date: Wed, 8 Oct 2025 17:07:26 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, 
	kernel-team@...roid.com
Subject: Re: [PATCH v22 2/6] sched/locking: Add blocked_on_state to provide
 necessary tri-state for proxy return-migration

On Wed, Oct 8, 2025 at 4:26 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Sep 26, 2025 at 03:29:10AM +0000, John Stultz wrote:
> > As we add functionality to proxy execution, we may migrate a
> > donor task to a runqueue where it can't run due to cpu affinity.
> > Thus, we must be careful to ensure we return-migrate the task
> > back to a cpu in its cpumask when it becomes unblocked.
> >
> > Thus we need more then just a binary concept of the task being
> > blocked on a mutex or not.
> >
> > So add a blocked_on_state value to the task, that allows the
> > task to move through BO_RUNNING -> BO_BLOCKED -> BO_WAKING
> > and back to BO_RUNNING. This provides a guard state in
> > BO_WAKING so we can know the task is no longer blocked
> > but we don't want to run it until we have potentially
> > done return migration, back to a usable cpu.
> >
> > Signed-off-by: John Stultz <jstultz@...gle.com>
> > ---
> >  include/linux/sched.h     | 92 +++++++++++++++++++++++++++++++++------
> >  init/init_task.c          |  3 ++
> >  kernel/fork.c             |  3 ++
> >  kernel/locking/mutex.c    | 15 ++++---
> >  kernel/locking/ww_mutex.h | 20 ++++-----
> >  kernel/sched/core.c       | 45 +++++++++++++++++--
> >  kernel/sched/sched.h      |  6 ++-
> >  7 files changed, 146 insertions(+), 38 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index cb4e81d9d9b67..8245940783c77 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -813,6 +813,12 @@ struct kmap_ctrl {
> >  #endif
> >  };
> >
> > +enum blocked_on_state {
> > +     BO_RUNNABLE,
> > +     BO_BLOCKED,
> > +     BO_WAKING,
> > +};
>
> I am still struggling with all this.

My apologies. I really appreciate you taking the time to look it over!

>   RUNNABLE is !p->blocked_on
>   BLOCKED is !!p->blocked_on
>   WAKING is !!p->blocked_on but you need magical beans
>
> I'm not sure I follow the argument above, and there is a distinct lack
> of comments with this enum explaining the states (although there are
> some comments scattered across the patch itself).

That's fair. I'll try to improve the comments there.

So the blocked_on_state values don't quite match to blocked_on as you
listed above, but it did evolve out of the fact that just having the
blocked_on pointer didn't give us enough state and had lots of subtle
bugs, so having more state helped stabilize this. I do agree it has
some duplicative aspects with the task->__state, so I'd love to
flatten it down, but so far I've not found a good way.

So p->blocked_on can be separated off, as is totally managed by the
__mutex_lock_common() path. It's set to the mutex we're trying to
take, and cleared when we get it.

Where as the p->blocked_on_state tells us:
BO_RUNNABLE: If the task was picked from the runqueue it can be run on that cpu.
BO_BLOCKED: The task can be picked, but cannot be executed, it can
only be a donor task. It may migrate to the runqueue of cpus that it
is not allowed to run on.
BO_WAKING: An intermediate "gate" state. This task was BO_BLOCKED, and
we'd like it to be BO_RUNNABLE, but we have to address that it might
be on a runqueue it can't run on. So this prevents tasks from being
run until they have been evaluated for return migration.  Ideally ttwu
will handle the return migration, but there are cases where we will do
it manually in find_proxy_task() if we come across a task in the chain
with this state.

So, just to clarify your summary, the a task can be
p->blocked_on_state=BO_RUNNABLE and p->blocked_on can be set, since we
need to run the task in order for it to complete __mutex_lock_common()
to clear its own blocked_on pointer.  BO_BLOCKED does imply
!!p->blocked_on and BO_WAKING implies !!p->blocked_on but also that we
need to evaluate return migration before we run it.

> Last time we talked about this:
>
>   https://lkml.kernel.org/r/20241216165419.GE35539@noisy.programming.kicks-ass.net
>
> I was equally confused; and suggested not having the WAKING state by
> simply dequeueing the offending task and letting ttwu() sort it out --
> since we know a wakeup will be coming our way.

So yeah, and I really appreciated that suggestion. I used that dequeue
and wake approach in the "manual" return-migration path
(proxy_force_return()), and it did simplify things, but I haven't been
able to apply it everywhere.

> I'm thinking that suggesting didn't work out somehow, but I'm still not
> sure I understand why.

So the main issue is about where we end up setting the task to
BO_WAKING (via set_blocked_on_waking()). This is done in
__mutex_unlock_slowpath(), __ww_mutex_die(), and __ww_mutex_wound().
And in those cases, we are already holding the mutex->wait_lock, and
sometimes a task's blocked_lock, without the rq lock.  So we can't
just grab the rq lock out of order, and we probably shouldn't drop and
try to reacquire the blocked_lock and wait_lock there.

Though, one approach that I just thought of would be to have a special
wake_up_q call, which would handle dequeuing the blocked_on tasks on
the wake_q before doing the wakeup?  I can give that a try.

Though I'm not sure if that will still enable us to drop the
blocked_on_state tri-state. Since I worry we may be able to get
spurious wakeups on blocked_on tasks outside the mutex_unlock_slowpath
or ww_mutex_die/wound paths. Then we risk running a proxy-migrated
task on a cpu outside its affinity set. As without proxy-migration,
spurious wakeups are ok as the task will just loop back into schedule,
but with proxy migration, we have to be sure we return migrate first.

> There is this comment:
>
>
> +               /*
> +                * If a ww_mutex hits the die/wound case, it marks the task as
> +                * BO_WAKING and calls try_to_wake_up(), so that the mutex
> +                * cycle can be broken and we avoid a deadlock.
> +                *
> +                * However, if at that moment, we are here on the cpu which the
> +                * die/wounded task is enqueued, we might loop on the cycle as
> +                * BO_WAKING still causes task_is_blocked() to return true
> +                * (since we want return migration to occur before we run the
> +                * task).
> +                *
> +                * Unfortunately since we hold the rq lock, it will block
> +                * try_to_wake_up from completing and doing the return
> +                * migration.
> +                *
> +                * So when we hit a !BO_BLOCKED task briefly schedule idle
> +                * so we release the rq and let the wakeup complete.
> +                */
> +               if (p->blocked_on_state != BO_BLOCKED)
> +                       return proxy_resched_idle(rq);
>
>
> Which I presume tries to clarify things, but that only had me scratching
> my head again. Why would you need task_is_blocked() to affect return
> migration?

So task_is_blocked() returns true when p->blocked_on is set and
p->blocked_on_state != BO_RUNNABLE.  So BO_WAKING tasks are still
prevented from being selected to run, until they have had a chance to
be return-migrated (because as a donor they may be on a runqueue where
they can't actually run on).

The problem this comment tries to describe is that due to ww_mutexes,
there may be a loop in the blocked_on chain. So the cpu running
find_proxy_task() might spin following this loop. The ww_mutex logic
will fix the loop via ww_mutex_die/wound, which sets BO_WAKING, and
wakes the task up to release the lock.   However, the try_to_wake_up()
can get stuck waiting for the rqlock that the cpu looping in
find_proxy_task() is holding. So for the case where it's not
BO_BLOCKED, we resched_idle for a moment to drop the lock and let
try_to_wake_up() complete.

Though I worry I've just repeated the comment here, so let me know if
this wasn't helpful in clarifying things.

thanks
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ