[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCphG8LzYEM321fXpg4xCXL8kqfKbtttV-HN+e1tZCGahQ@mail.gmail.com>
Date: Fri, 19 Sep 2025 19:38:20 -0700
From: John Stultz <jstultz@...gle.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes <joelagnelf@...dia.com>,
Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Valentin Schneider <vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>,
Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>,
"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>,
Xuewen Yan <xuewen.yan94@...il.com>, Thomas Gleixner <tglx@...utronix.de>,
Daniel Lezcano <daniel.lezcano@...aro.org>, Suleiman Souhlal <suleiman@...gle.com>,
kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>, kernel-team@...roid.com
Subject: Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration
(and return migration)
On Mon, Sep 15, 2025 at 3:07 AM 'K Prateek Nayak' via kernel-team
<kernel-team@...roid.com> wrote:
> On 9/4/2025 5:51 AM, John Stultz wrote:
> > + /* XXX - Added to address problems with changed dl_server semantics - double check */
> > + __put_prev_set_next_dl_server(rq, rq->donor, rq->curr);
>
> Given we are tagging the rq->dl_server to the donor's context, should we
> do:
>
> __put_prev_set_next_dl_server(rq, rq->donor, rq->idle);
>
> ... since we are setting rq->idle as next task and the donor?
>
> On a side note, this can perhaps be simplified as:
>
> put_prev_set_next_task(rq, rq->donor, rq->idle);
> rq_set_donor(rq, rq->idle);
>
> put_prev_set_next_task() will take are of the dl_server handoff too.
Nice! That is simpler. I think I can also simplify those two lines to
proxy_resched_idle() as well.
And looking, the big comment above that section needs an update as
well as its out of date.
> > + put_prev_task(rq, rq->donor);
> > + rq_set_donor(rq, rq->idle);
> > + set_next_task(rq, rq->idle);
> > +
> > + WARN_ON(p == rq->curr);
> > +
> > + deactivate_task(rq, p, 0);
> > + proxy_set_task_cpu(p, target_cpu);
> > +
> > + zap_balance_callbacks(rq);
>
> Is this zap necessary? Given we return NULL from find_proxy_task() for
> migrate case, __schedule() would zap the callback soon. I don't see
> any WARN_ON() for the balance callbacks in the unpin + repin path so
> this might not be necessary or am I mistaken?
So unfortunately the issue is that when we're letting go of the
runqueue other CPUs can jump in via sched_balance_domains() ->
sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
trip the
WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
&balance_push_callback); case if we leave callbacks.
So we need to zap the callbacks before we drop the rq lock. I'll add
a comment to make that clear.
And yeah, that means we do call zap_balance_callbacks() again after we
re-take the lock and return null.
I guess we could remove the zap_balance_callbacks() call in
__schedule() and instead call it in every case where find_proxy_task()
returns NULL? Or we call it twice in the migration paths, as it should
just have one entry the second time.
> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > + struct task_struct *p)
> > +{
> > + lockdep_assert_rq_held(rq);
> > +
> > + put_prev_task(rq, rq->donor);
> > + rq_set_donor(rq, rq->idle);
> > + set_next_task(rq, rq->idle);
>
> Similar set of comments as above.
Ack.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c2..cc531eb939831 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
> > se = &p->se;
> >
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > - if (prev->sched_class != &fair_sched_class)
> > + if (prev->sched_class != &fair_sched_class ||
> > + rq->curr != rq->donor)
>
> Why is this special handling required?
Hrmm.. that's a good question, as I haven't thought about that in
awhile, and I'm unfortunately forgetful. My instinct is that its a
concern that by checking the prev sched class is fair, is assuming the
prev task was selected by the fair scheduler's pick_next_task last
time around. But since we may have picked a blocked RT task as donor,
if we are proxying, we can't assume prev was previously chosen by
pick_next_task_fair(). So we skip the optimization just to be careful.
But this may be overly cautious and looking it over I don't
immediately see it as necessary. So I'll drop it and do some thorough
testing without it. If I do find an issue I'll re-add it with a much
better comment so I don't forget again. I do really appreciate you
checking me here!
thanks
-john
Powered by blists - more mailing lists