linux-kernel - Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration (and return migration)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <044fae5a-20a2-4b65-83c4-2ba985543232@amd.com>
Date: Mon, 22 Sep 2025 09:44:39 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: John Stultz <jstultz@...gle.com>
CC: LKML <linux-kernel@...r.kernel.org>, Joel Fernandes
	<joelagnelf@...dia.com>, Qais Yousef <qyousef@...alina.io>, Ingo Molnar
	<mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
	<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider
	<vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Will Deacon <will@...nel.org>, Waiman Long
	<longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, "Paul E. McKenney"
	<paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, Xuewen Yan
	<xuewen.yan94@...il.com>, Thomas Gleixner <tglx@...utronix.de>, "Daniel
 Lezcano" <daniel.lezcano@...aro.org>, Suleiman Souhlal <suleiman@...gle.com>,
	kuyo chang <kuyo.chang@...iatek.com>, hupu <hupu.gm@...il.com>,
	<kernel-team@...roid.com>
Subject: Re: [RESEND][PATCH v21 4/6] sched: Handle blocked-waiter migration
 (and return migration)

Hello John,

On 9/20/2025 8:08 AM, John Stultz wrote:
>>> +     put_prev_task(rq, rq->donor);
>>> +     rq_set_donor(rq, rq->idle);
>>> +     set_next_task(rq, rq->idle);
>>> +
>>> +     WARN_ON(p == rq->curr);
>>> +
>>> +     deactivate_task(rq, p, 0);
>>> +     proxy_set_task_cpu(p, target_cpu);
>>> +
>>> +     zap_balance_callbacks(rq);
>>
>> Is this zap necessary? Given we return NULL from find_proxy_task() for
>> migrate case, __schedule() would zap the callback soon. I don't see
>> any WARN_ON() for the balance callbacks in the unpin + repin path so
>> this might not be necessary or am I mistaken?
> 
> So unfortunately the issue is that when we're letting go of the
> runqueue other CPUs can jump in via sched_balance_domains()  ->
> sched_balance_rq() -> rq_lock_irqsave() -> rq_pin_lock() which will
> trip the
> WARN_ON_ONCE(rq->balance_callback && rq->balance_callback !=
> &balance_push_callback); case if we leave callbacks.

Ah! That makes sense. I overlooked the fact that we have *other CPUs*
that can take the rq lock once dropped.

> 
> So we need to zap the callbacks before we drop the rq lock.  I'll add
> a comment to make that clear.

Thank you!

> 
> And yeah, that means we do call zap_balance_callbacks() again after we
> re-take the lock and return null.
> 
> I guess we could remove the zap_balance_callbacks() call in
> __schedule() and instead call it in every case where find_proxy_task()
> returns NULL? Or we call it twice in the migration paths, as it should
> just have one entry the second time.

I think a comment would be good enough with the current flow. I'll let
you decide which is the best option based on your understanding of the
complete flow (I'm still slowly wrapping my head around all this :)

[..snip..]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index b173a059315c2..cc531eb939831 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8781,7 +8781,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>>       se = &p->se;
>>>
>>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>> -     if (prev->sched_class != &fair_sched_class)
>>> +     if (prev->sched_class != &fair_sched_class ||
>>> +         rq->curr != rq->donor)
>>
>> Why is this special handling required?
> 
> Hrmm.. that's a good question, as I haven't thought about that in
> awhile, and I'm unfortunately forgetful. My instinct is that its a
> concern that by checking the prev sched class is fair, is assuming the
> prev task was selected by the fair scheduler's pick_next_task last
> time around. But since we may have picked a blocked RT task as donor,
> if we are proxying, we can't assume prev was previously chosen by
> pick_next_task_fair(). So we skip the optimization just to be careful.

My interpretation was - since pick_next_task_fair() selected the
scheduling context, it shouldn't matter what the execution context was.

> 
> But this may be overly cautious and looking it over I don't
> immediately see it as necessary. So I'll drop it and do some thorough
> testing without it. If I do find an issue I'll re-add it with a much
> better comment so I don't forget again.

Thank you! I too will do some testing and let you know if I see
something go awry.

-- 
Thanks and Regards,
Prateek