[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251201125851.272237-1-sieberf@amazon.com>
Date: Mon, 1 Dec 2025 14:58:49 +0200
From: Fernand Sieber <sieberf@...zon.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
CC: Peter Zijlstra <peterz@...radead.org>, <mingo@...hat.com>,
<linux-kernel@...r.kernel.org>, <juri.lelli@...hat.com>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <kprateek.nayak@....com>,
<dwmw@...zon.co.uk>, <jschoenh@...zon.de>, <liuyuxua@...zon.com>,
<abusse@...zon.com>, <gmazz@...zon.com>, <rkagan@...zon.com>
Subject: Re: [PATCH] sched/fair: Force idle aware load balancing
On Fri, 28 Nov 2025 at 14:50, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> On Fri, 28 Nov 2025 at 12:14, Peter Zijlstra <peterz@...radead.org> wrote:
> >
> > On Thu, Nov 27, 2025 at 10:27:17PM +0200, Fernand Sieber wrote:
> >
> > > @@ -11123,7 +11136,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > return;
> > > }
> > >
> > > - if (busiest->group_type == group_smt_balance) {
> > > + if (busiest->group_type == group_smt_balance ||
> > > + busiest->forceidle_weight) {
> >
> > Should we not instead make it so that we select group_smt_balance in
> > this case?
>
> Why do we need this test ? We have already removed forced idle cpus
> from statistics ?
>
> I suppose Fernand wants to cover cases where there is 1 task per CPUs
> so we are balanced but one CPU is forced idle and we want to force
> migrating a task to then try to move back another one ? In this case
> it should be detected early and become group_imbalanced type
> Also what happens if we could migrate more than one task
I've removed this override in v2, it doesn't seem to make much a
difference after doing more benchmarking.
When I traced LB inefficiencies, I noticed in some situations that a
large imbalance (overloaded vs spare capacity) was detected, but
remediation was delayed. So the intention of the override was to "nudge"
the LB to take a remediation action immediately, regardless of the load
to move, with the idea that it's better to migrate anything now rather
than waste capacity in force idle for longer.
This override was probably not the right tool for it. If I get a chance
I'll try to dive deeper and provide more details.
One different thing I noticed is that the task_hot check has a cookie
check which is more or less bound to fail on busy large system running
lots of different cookied tasks (e.g hypervisor on large servers with
cookied time shared vCPUs) because there's almost zero chance that the
target CPU is randomly running the same cookie as the migrating task.
This delays migrations unnecessarily if the run queues are shorts and
there are no valid spare candidates. Need to think more about that one,
but if you have any ideas let me know.. ? Maybe instead of having this
check the list of migrating tasks should be sorted to prioritize
matching cookie tasks first if any, similar than proposed in the cache
aware scheduling RFC?
https://lwn.net/ml/all/26e7bfa88163e13ba1ebefbb54ecf5f42d84f884.1760206683.git.tim.c.chen@linux.intel.com/
Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07
Powered by blists - more mailing lists