linux-kernel - Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240129225344.rx56bk22x2xjxqyt@airbuntu>
Date: Mon, 29 Jan 2024 22:53:44 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	linux-kernel@...r.kernel.org,
	Pierre Gondois <Pierre.Gondois@....com>
Subject: Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when
 updating misfit

On 01/28/24 23:50, Qais Yousef wrote:
> On 01/26/24 15:08, Vincent Guittot wrote:
> 
> > > TBH I had a bit of confirmation bias that this is a problem based on the fix
> > > (0ae78eec8aa6) that we had in the past. So on verification I looked at
> > > balance_interval and this reproducer which is a not the same as the original
> > > one and it might be exposing another problem and I didn't think twice about it.
> > 
> > I checked the behavior more deeply and I confirm that I don't see
> > improvement for the use case described above. I would say that it's
> > even worse as I can see some runs where the task stays on little
> > whereas a big core has been added in the affinity. Having in mind that
> > my system is pretty idle which means that there is almost no other
> > reason to trigger an ilb than the misfit task, the change in
> > check_misfit_status() is probably the reason for never kicking an ilb
> > for such case
> 
> It seems I reproduced another problem while trying to reproduce the original
> issue, eh.
> 
> I did dig more and from what I see the issue is that the rd->overload is not
> being set correctly. Which I believe what causes the delays (see attached
> picture how rd.overloaded is 0 with some spikes). Only when CPU7
> newidle_balance() coincided with rd->overload being 1 that the migration
> happens. With the below hack I can see that rd->overload is 1 all the time
> (even after the move as we still trigger a misfit on the big CPU). With my
> patch only rd->overload is set to 1 (because of this task) only for a short
> period after we change affinity.
> 
> 	diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> 	index df348aa55d3c..86069fe527f9 100644
> 	--- a/kernel/sched/fair.c
> 	+++ b/kernel/sched/fair.c
> 	@@ -9707,8 +9707,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> 				continue;
> 			}
> 
> 	-               if (local_group)
> 	-                       continue;
> 	+               /* if (local_group) */
> 	+                       /* continue; */
> 
> 			if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> 				/* Check for a misfit task on the cpu */
> 
> I am not sure what the right fix is, but it seems this condition is required
> for the 2nd leg of this if condition when we compare with load? I don't think
> we should skip the misfit check.

I'm still not sure I got the original intent of why we skip for local_group. We
need to set sg_status which operates at root domain to enable a cpu to
pull a misfit task.

AFAICS newidle_balance() will return without doing anything if rd->overload is
not set. So making sure we update this flag always and for both legs is
necessary IIUC

	diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
	index df348aa55d3c..bd2f402eac41 100644
	--- a/kernel/sched/fair.c
	+++ b/kernel/sched/fair.c
	@@ -9707,9 +9707,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
				continue;
			}

	-               if (local_group)
	-                       continue;
	-
			if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
				/* Check for a misfit task on the cpu */
				if (sgs->group_misfit_task_load < rq->misfit_task_load) {
	@@ -9719,8 +9716,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
			} else if ((env->idle != CPU_NOT_IDLE) &&
				   sched_reduced_capacity(rq, env->sd)) {
				/* Check for a task running on a CPU with reduced capacity */
	-                       if (sgs->group_misfit_task_load < load)
	+                       if (sgs->group_misfit_task_load < load) {
					sgs->group_misfit_task_load = load;
	+                               *sg_status |= SG_OVERLOAD;
	+                       }
			}
		}

I was wondering why we never pull at TICK/rebalance_domains() where no such
check is made. But when newidle_balance() returns early it sets
update_next_balance() to add balance_interval which is already long. So we end
up delaying things further thinking we've 'attempted' a load balance and
it wasn't necessary - but in reality we failed to see it and not allowing the
rebalance_domains() to see it either by continuing to push it forward.