linux-kernel - Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCphG8LzYEM321fXpg4xCXL8kqfKbtttV-HN+e1tZCGahQ@mail.gmail.com>
Date: Fri, 19 Sep 2025 19:38:20 -0700
From: John Stultz <jstultz@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>, 
	Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Valentin Schneider <vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Daniel Lezcano <daniel.lezcano@...aro.org>, Suleiman Souhlal <suleiman@...gle.com>, 
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, kernel-team@...roid.com
Subject: Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration
 (and return migration)

On Mon, Sep 15, 2025 at 3:07 AM 'K Prateek Nayak' via kernel-team
<kernel-team@...roid.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > +     /* XXX - Added to address problems with changed dl_server semantics - double check */
> > +     __put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
>
> Given we are tagging the rq->dl_server to the donor's context, should we
> do:
>
>     __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);
>
> ... since we are setting rq->idle as next task and the donor?
>
> On a side note, this can perhaps be simplified as:
>
>     put_prev_set_next_task(rq, rq->donor, rq->idle);
>     rq_set_donor(rq, rq->idle);
>
> put_prev_set_next_task() will take are of the dl_server handoff too.

Nice! That is simpler. I think I can also simplify those two lines to
proxy_resched_idle() as well.
And looking, the big comment above that section needs an update as
well as its out of date.

> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, 0);
> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     zap_balance_callbacks(rq);
>
> Is this zap necessary? Given we return NULL from find_proxy_task() for
> migrate case, __schedule() would zap the callback soon. I don't see
> any WARN_ON() for the balance callbacks in the unpin + repin path so
> this might not be necessary or am I mistaken?

So unfortunately the issue is that when we're letting go of the
runqueue other CPUs can jump in via sched_balance_domains()  ->
sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
trip the
WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
&balance_push_callback); case if we leave callbacks.

So we need to zap the callbacks before we drop the rq lock.  I'll add
a comment to make that clear.

And yeah, that means we do call zap_balance_callbacks() again after we
re-take the lock and return null.

I guess we could remove the zap_balance_callbacks() call in
__schedule() and instead call it in every case where find_proxy_task()
returns NULL? Or we call it twice in the migration paths, as it should
just have one entry the second time.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     put_prev_task(rq, rq->donor);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_next_task(rq, rq->idle);
>
> Similar set of comments as above.

Ack.

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c2..cc531eb939831 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> >       se = &p->se;
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -     if (prev->sched_class != &fair_sched_class)
> > +     if (prev->sched_class != &fair_sched_class ||
> > +         rq->curr != rq->donor)
>
> Why is this special handling required?

Hrmm.. that's a good question, as I haven't thought about that in
awhile, and I'm unfortunately forgetful. My instinct is that its a
concern that by checking the prev sched class is fair, is assuming the
prev task was selected by the fair scheduler's pick_next_task last
time around. But since we may have picked a blocked RT task as donor,
if we are proxying, we can't assume prev was previously chosen by
pick_next_task_fair(). So we skip the optimization just to be careful.

But this may be overly cautious and looking it over I don't
immediately see it as necessary. So I'll drop it and do some thorough
testing without it. If I do find an issue I'll re-add it with a much
better comment so I don't forget again.  I do really appreciate you
checking me here!

thanks
-john