[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240128235005.txztdbdq2obyi4n6@airbuntu>
Date: Sun, 28 Jan 2024 23:50:05 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
linux-kernel@...r.kernel.org,
Pierre Gondois <Pierre.Gondois@....com>
Subject: Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when
updating misfit
On 01/26/24 15:08, Vincent Guittot wrote:
> > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > balance_interval and this reproducer which is a not the same as the original
> > one and it might be exposing another problem and I didn't think twice about it.
>
> I checked the behavior more deeply and I confirm that I don't see
> improvement for the use case described above. I would say that it's
> even worse as I can see some runs where the task stays on little
> whereas a big core has been added in the affinity. Having in mind that
> my system is pretty idle which means that there is almost no other
> reason to trigger an ilb than the misfit task, the change in
> check_misfit_status() is probably the reason for never kicking an ilb
> for such case
It seems I reproduced another problem while trying to reproduce the original
issue, eh.
I did dig more and from what I see the issue is that the rd->overload is not
being set correctly. Which I believe what causes the delays (see attached
picture how rd.overloaded is 0 with some spikes). Only when CPU7
newidle_balance() coincided with rd->overload being 1 that the migration
happens. With the below hack I can see that rd->overload is 1 all the time
(even after the move as we still trigger a misfit on the big CPU). With my
patch only rd->overload is set to 1 (because of this task) only for a short
period after we change affinity.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df348aa55d3c..86069fe527f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
}
- if (local_group)
- continue;
+ /* if (local_group) */
+ /* continue; */
if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */
I am not sure what the right fix is, but it seems this condition is required
for the 2nd leg of this if condition when we compare with load? I don't think
we should skip the misfit check.
Thanks
--
Qais Yousef
Download attachment "misfit_overloaded.png" of type "image/png" (149757 bytes)
Powered by blists - more mailing lists