linux-kernel - Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250110115720.GA17405@noisy.programming.kicks-ass.net>
Date: Fri, 10 Jan 2025 12:57:20 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Doug Smythies <dsmythies@...us.net>
Cc: linux-kernel@...r.kernel.org, vincent.guittot@...aro.org,
	'Ingo Molnar' <mingo@...nel.org>, wuyun.abel@...edance.com
Subject: Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> Thanks for all your hard work on this.
> 
> On 2025.01.09 03:00 Peter Zijlstra wrote:
> 
> ...
> 
> > This made me have a very hard look at reweight_entity(), and
> > specifically the ->on_rq case, which is more prominent with
> > DELAY_DEQUEUE.
> >
> > And indeed, it is all sorts of broken. While the computation of the new
> > lag is correct, the computation for the new vruntime, using the new lag
> > is broken for it does not consider the logic set out in place_entity().
> >
> > With the below patch, I now see things like:
> >    
> >    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> >                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
> >                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
> }
> >    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> >                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
> >                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
> }
> > 
> > Which isn't perfect yet, but much closer.
> 
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.

Progress!

> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:

Right, so by happy accident I also left mine running overnight and I
think I caught one of those weird spikes. It took a bit of staring to
figure out what went wrong this time, but what I now think is the thing
that sets off the chain of fail is a combination of DELAY_DEQUEUE and
the cgroup reweight stuff.

Specifically, when a cgroup's CPU queue becomes empty,
calc_group_shares() will drop its weight to the floor.

Now, normally, when a queue goes empty, it gets dequeued from its
parent and its weight is immaterial. However, with DELAY_DEQUEUE the
thing will stick around for a while -- at a very low weight.

What makes this esp. troublesome is that even for an active cgroup (like
the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
like). Worse, the avg_vruntime thing uses load_scale_down() because u64
just isn't all that big :/

(if only all 64bit machines could do 128bit divisions as cheaply as x86_64)

This means that on a 16 CPU machine, the weight of a 'normal' all busy
queue will be 64, and the weight of an empty queue will be 2, which means
the effect of the ginormous lag on the avg_vruntime is fairly
significant, pushing it wildly off balance and affecting placement of
new tasks.

Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
competition as the entity was.

As such, we should not adjust the weight of an empty queue.

I've started a new run, and some 15 minutes of runtime show nothing
interesting, I'll have to let it run for a while again.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..93644b3983d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long shares;

-	if (!gcfs_rq)
+	/*
+	 * When a group becomes empty, preserve its weight. This matters for
+	 * DELAY_DEQUEUE.
+	 */
+	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;

 	if (throttled_hierarchy(gcfs_rq))