linux-kernel - RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <002f01db631d$d265a600$7730f200$@telus.net>
Date: Thu, 9 Jan 2025 21:09:26 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Peter Zijlstra'" <peterz@...radead.org>
Cc: <linux-kernel@...r.kernel.org>,
	<vincent.guittot@...aro.org>,
	"'Ingo Molnar'" <mingo@...nel.org>,
	<wuyun.abel@...edance.com>,
	"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

Hi Peter,

Thanks for all your hard work on this.

On 2025.01.09 03:00 Peter Zijlstra wrote:

...

> This made me have a very hard look at reweight_entity(), and
> specifically the ->on_rq case, which is more prominent with
> DELAY_DEQUEUE.
>
> And indeed, it is all sorts of broken. While the computation of the new
> lag is correct, the computation for the new vruntime, using the new lag
> is broken for it does not consider the logic set out in place_entity().
>
> With the below patch, I now see things like:
>    
>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
} ->
>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
}
>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
6316614641111 } ->
>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
}
> 
> Which isn't perfect yet, but much closer.

Agreed.
I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
It still compares to the "b12" kernel (the last good one in the kernel bisection).
It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
verses 6 seconds without the patch.

I left things running for many hours and will let it continue overnight.
There seems to have been an issue at one spot in time:

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
488994  1736476550.732222       -       99.76   12889
488520  1736476550.732222       11      99.76   1012
960999  1736476552.694222       -       99.76   17922
960587  1736476552.694222       11      99.76   1493
914999  1736476554.610222       -       99.76   23579
914597  1736476554.610222       11      99.76   1962
809999  1736476556.421222       -       99.76   23134
809598  1736476556.421222       11      99.76   1917
770998  1736476558.193221       -       99.76   21757
770603  1736476558.193221       11      99.76   1811
726999  1736476559.921222       -       99.76   21294
726600  1736476559.921222       11      99.76   1772
686998  1736476561.609221       -       99.76   20801
686600  1736476561.609221       11      99.76   1731
650998  1736476563.261221       -       99.76   20280
650601  1736476563.261221       11      99.76   1688
610998  1736476564.873221       -       99.76   19857
610606  1736476564.873221       11      99.76   1653

I had one of these the other day also, but they were all 6 seconds.
Its like a burst of problematic data. I have the data somewhere,
and can try to find it tomorrow.

>
> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>

...

Download attachment "turbostat-sampling-issue-fixed-seconds.png" of type "image/png" (62449 bytes)