linux-kernel - RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <00c201db6547$b43b9a50$1cb2cef0$@telus.net>
Date: Sun, 12 Jan 2025 15:14:17 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Peter Zijlstra'" <peterz@...radead.org>
Cc: <linux-kernel@...r.kernel.org>,
	<vincent.guittot@...aro.org>,
	"'Ingo Molnar'" <mingo@...nel.org>,
	<wuyun.abel@...edance.com>,
	"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

On 2025.01.10 03:57 Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
>> On 2025.01.09 03:00 Peter Zijlstra wrote:
>> 
>> ...
>> 
>>> This made me have a very hard look at reweight_entity(), and
>>> specifically the ->on_rq case, which is more prominent with
>>> DELAY_DEQUEUE.
>>>
>>> And indeed, it is all sorts of broken. While the computation of the new
>>> lag is correct, the computation for the new vruntime, using the new lag
>>> is broken for it does not consider the logic set out in place_entity().
>>>
>>> With the below patch, I now see things like:
>>>    
>>>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>>>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline:
4860516552475
>> } ->
>>>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline:
6427157349203
>> }
>>>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>>>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
>> 6316614641111 } ->
>>>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline:
4874220535650
>> }
>>> 
>>> Which isn't perfect yet, but much closer.
>> 
>> Agreed.
>> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
>> It still compares to the "b12" kernel (the last good one in the kernel bisection).
>> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
>> verses 6 seconds without the patch.
>
> Progress!
>
>> I left things running for many hours and will let it continue overnight.
>> There seems to have been an issue at one spot in time:
>
> Right, so by happy accident I also left mine running overnight and I
> think I caught one of those weird spikes. It took a bit of staring to
> figure out what went wrong this time, but what I now think is the thing
> that sets off the chain of fail is a combination of DELAY_DEQUEUE and
> the cgroup reweight stuff.
>
> Specifically, when a cgroup's CPU queue becomes empty,
> calc_group_shares() will drop its weight to the floor.
>
> Now, normally, when a queue goes empty, it gets dequeued from its
> parent and its weight is immaterial. However, with DELAY_DEQUEUE the
> thing will stick around for a while -- at a very low weight.
>
> What makes this esp. troublesome is that even for an active cgroup (like
> the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
> like). Worse, the avg_vruntime thing uses load_scale_down() because u64
> just isn't all that big :/
>
> (if only all 64bit machines could do 128bit divisions as cheaply as x86_64)
>
> This means that on a 16 CPU machine, the weight of a 'normal' all busy
> queue will be 64, and the weight of an empty queue will be 2, which means
> the effect of the ginormous lag on the avg_vruntime is fairly
> significant, pushing it wildly off balance and affecting placement of
> new tasks.
>
> Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
> competition as the entity was.
>
> As such, we should not adjust the weight of an empty queue.
>
> I've started a new run, and some 15 minutes of runtime show nothing
> interesting, I'll have to let it run for a while again.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..93644b3983d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
> 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> 	long shares;
> 
> -	if (!gcfs_rq)
> +	/*
> +	 * When a group becomes empty, preserve its weight. This matters for
> +	 * DELAY_DEQUEUE.
> +	 */
> +	if (!gcfs_rq || !gcfs_rq->load.weight)
> 		return;
> 
> 	if (throttled_hierarchy(gcfs_rq))

I tested the above patch on top of the previous patch.
Multiple tests and multiple methods over many hours and 
I never got any hit at all for a detected CPU migration greater than or
equal to 10 milliseconds.
Which is good news.

The test I have been running to create some of the graphs I have been
attaching is a little different, using turbostat with different options:

turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1

And with this test I get intervals over 1 second by over 10 milliseconds.
(I referred to this observation in the previous email.).
I attach a new version of a previous graph, with the data from this "patch 2" added.
Also, below are the same results presented as histograms
(some repeated from my previous email):

First the b12 kernel (the last good one in the kernel bisection):

Time          Occurrences
1.000000, 3282
1.001000, 1826
1.002000, 227
1.003000, 1852
1.004000, 1036
1.005000, 731
1.006000, 75
1.007000, 30
1.008000, 9
1.009000, 2
1.010000, 1
1.011000, 1

Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent)

Second Kernel 6.13-rc6+the first patch

Time          Occurrences
1.000000, 1274
1.001000, 1474
1.002000, 512
1.003000, 3201
1.004000, 849
1.005000, 593
1.006000, 246
1.007000, 104
1.008000, 36
1.009000, 15
1.010000, 19
1.011000, 16
1.012000, 11
1.013000, 27
1.014000, 26
1.015000, 35
1.016000, 105
1.017000, 85
1.018000, 135
1.019000, 283
1.020000, 17
1.021000, 4
1.022000, 3
1.023000, 1

Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent)

Third: Kernel 6.13-rc6+the first patch+the above patch:

1.000000, 2034
1.001000, 2108
1.002000, 2030
1.003000, 2492
1.004000, 216
1.005000, 109
1.006000, 23
1.007000, 8
1.008000, 3
1.009000, 9
1.010000, 1
1.011000, 2
1.012000, 2
1.014000, 3
1.015000, 10
1.016000, 19
1.017000, 1
1.018000, 1

Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)

Where, and for example, this line:

1.016000, 19

means that there were 19 occurrences of turbostat interval times
between 1.016 and 1.016999 seconds.

As mentioned earlier, I haven't been able to obtain any detailed
information as to where any extra delay is occurring, in particular
if it is related to CPU migration or not.

I am assuming the system under test is perturbed just enough
by the slight difference in the test that the longer times do
not occur.

As a side note, and just for informational purposes:
The changes do have an effect on performance.
Using 40 pairs of ping-pong token passing rings running 30 million
loops I get:

Where:
nodelay = NO_DELAY_DEQUEUE
delay = DELAY_DEQUEUE
noauto = "cgroup_disable=cpu noautogroup" on grub command line
auto = nothing cgroup related on the grub command line.
All tests done with the CPU frequency scaling governor = performance.
(lesser numbers are better)

noauto	nodelay	test 1	16.24 uSec/loop	reference test
noauto	nodelay	test 2	16.22 uSec/loop 	-0.15%
noauto	delay	test 1	15.41 uSec/loop	-5.16%
auto	nodelay	test 1	24.31 uSec/loop	+49.88%
auto	nodelay	test 2	24.31 uSec/loop	+49.24%
auto	delay	test 1	21.42 uSec/loop	+21.42%
auto	delay	test 2	21.71 uSec/loop	+21.71%

A graph is also attached showing the same results, but for each
of the 40 pairs. Reveals differences in relative executions times
from first to last ping-pong pair.


Download attachment "turbostat-sampling-issue-fixed--2-seconds.png" of type "image/png" (78967 bytes)

Download attachment "40-ping-pairs-compare.png" of type "image/png" (54905 bytes)