linux-kernel - Re: [PATCH v22 4/6] sched: Handle blocked-waiter migration (and return migration)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCqH4rsnME4gVd2thSTnvSs3WvHXc=zWMPZcCEBF=vzxkg@mail.gmail.com>
Date: Wed, 15 Oct 2025 17:15:59 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Suleiman Souhlal <suleiman@...gle.com>, kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, 
	kernel-team@...roid.com
Subject: Re: [PATCH v22 4/6] sched: Handle blocked-waiter migration (and
 return migration)

On Wed, Oct 8, 2025 at 6:32 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Fri, Sep 26, 2025 at 03:29:12AM +0000, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7bba05c07a79d..d063d2c9bd5aa 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3157,6 +3157,14 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
> >
> >       __do_set_cpus_allowed(p, ctx);
> >
> > +     /*
> > +      * It might be that the p->wake_cpu is no longer
> > +      * allowed, so set it to the dest_cpu so return
> > +      * migration doesn't send it to an invalid cpu
> > +      */
>
> This comment isn't quite right; ->wake_cpu is a mere preference, it does
> not have correctness concerns. That is, it is okay for ->wake_cpu to not
> be inside cpus_allowed.

Oh! This is actually left over from earlier in the revisions where
return migration would migrate specifically to the wake_cpu, I thought
it was still important, but looking again, since the return migration
now does a block_task/wake_up_process(), I think we can drop this
whole chunk.  Thanks for catching that!

> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned int wake_cpu;
> > +
> > +     /*
> > +      * Since we are enqueuing a blocked task on a cpu it may
> > +      * not be able to run on, preserve wake_cpu when we
> > +      * __set_task_cpu so we can return the task to where it
> > +      * was previously runnable.
> > +      */
> > +     wake_cpu = p->wake_cpu;
> > +     __set_task_cpu(p, cpu);
> > +     p->wake_cpu = wake_cpu;
>
> Humm, perhaps add an argument to __set_task_cpu() to not set ->wake_cpu
> instead?

Hrm. I can rework it that way, but I do always feel that bool
arguments to functions makes the code less readable (since it's not
always clear from the usage what true/false means).  Making it an enum
 or defined flags argument help a bit with that, but it still seems it
would have much more impact to the source than this small helper
that's only called from the proxy related logic in two places at the
end of the day.

> > @@ -4215,8 +4294,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >        */
> >       scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
> >               smp_mb__after_spinlock();
> > -             if (!ttwu_state_match(p, state, &success))
> > -                     break;
> > +             if (!ttwu_state_match(p, state, &success)) {
> > +                     /*
> > +                      * If we're already TASK_RUNNING, and BO_WAKING
> > +                      * continue on to ttwu_runnable check to force
> > +                      * proxy_needs_return evaluation
> > +                      */
> > +                     if (!proxy_task_runnable_but_waking(p))
> > +                             break;
> > +             }
> >
> >               trace_sched_waking(p);
>
> Oh gawd :-( why !?!?
>
> So AFAICT this makes ttwu() do dequeue when it finds WAKING, and then it
> falls through to do the normal wakeup. So why can't we do dequeue to
> begin with -- instead of setting WAKING in the first place?

So hopefully the earlier discussion cleared that one up, but let me
know if you still object to this or have questions.

thanks
-john