[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <002f01db631d$d265a600$7730f200$@telus.net>
Date: Thu, 9 Jan 2025 21:09:26 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Peter Zijlstra'" <peterz@...radead.org>
Cc: <linux-kernel@...r.kernel.org>,
<vincent.guittot@...aro.org>,
"'Ingo Molnar'" <mingo@...nel.org>,
<wuyun.abel@...edance.com>,
"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
Hi Peter,
Thanks for all your hard work on this.
On 2025.01.09 03:00 Peter Zijlstra wrote:
...
> This made me have a very hard look at reweight_entity(), and
> specifically the ->on_rq case, which is more prominent with
> DELAY_DEQUEUE.
>
> And indeed, it is all sorts of broken. While the computation of the new
> lag is correct, the computation for the new vruntime, using the new lag
> is broken for it does not consider the logic set out in place_entity().
>
> With the below patch, I now see things like:
>
> migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
} ->
> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
}
> migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
6316614641111 } ->
> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
}
>
> Which isn't perfect yet, but much closer.
Agreed.
I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
It still compares to the "b12" kernel (the last good one in the kernel bisection).
It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
verses 6 seconds without the patch.
I left things running for many hours and will let it continue overnight.
There seems to have been an issue at one spot in time:
usec Time_Of_Day_Seconds CPU Busy% IRQ
488994 1736476550.732222 - 99.76 12889
488520 1736476550.732222 11 99.76 1012
960999 1736476552.694222 - 99.76 17922
960587 1736476552.694222 11 99.76 1493
914999 1736476554.610222 - 99.76 23579
914597 1736476554.610222 11 99.76 1962
809999 1736476556.421222 - 99.76 23134
809598 1736476556.421222 11 99.76 1917
770998 1736476558.193221 - 99.76 21757
770603 1736476558.193221 11 99.76 1811
726999 1736476559.921222 - 99.76 21294
726600 1736476559.921222 11 99.76 1772
686998 1736476561.609221 - 99.76 20801
686600 1736476561.609221 11 99.76 1731
650998 1736476563.261221 - 99.76 20280
650601 1736476563.261221 11 99.76 1688
610998 1736476564.873221 - 99.76 19857
610606 1736476564.873221 11 99.76 1653
I had one of these the other day also, but they were all 6 seconds.
Its like a burst of problematic data. I have the data somewhere,
and can try to find it tomorrow.
>
> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
...
Download attachment "turbostat-sampling-issue-fixed-seconds.png" of type "image/png" (62449 bytes)
Powered by blists - more mailing lists