linux-kernel - Re: scheduler performance regression since v6.11

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <23576497-af63-4074-8724-d75d0dca1817@meta.com>
Date: Tue, 20 May 2025 10:53:32 -0400
From: Chris Mason <clm@...a.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>,
        Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
        vschneid@...hat.com, Juri Lelli <juri.lelli@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: scheduler performance regression since v6.11

On 5/20/25 10:38 AM, Dietmar Eggemann wrote:
> On 16/05/2025 12:18, Peter Zijlstra wrote:
>> On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:
>>
>> Right, so I can reproduce on Thomas' SKL and maybe see some of it on my
>> SPR.
>>
>> I've managed to discover a whole bunch of ways that ttwu() can explode
>> again :-) But as you surmised, your workload *LOVES* TTWU_QUEUE, and
>> DELAYED_DEQUEUE takes some of that away, because those delayed things
>> remain on-rq and ttwu() can't deal with that other than by doing the
>> wakeup in-line and that's exactly the thing this workload hates most.
>>
>> (I'll keep poking at ttwu() to see if I can get a combination of
>> TTWU_QUEUE and DELAYED_DEQUEUE that does not explode in 'fun' ways)
>>
>> However, I've found that flipping the default in ttwu_queue_cond() seems
>> to make up for quite a bit -- for your workload.
>>
>> (basically, all the work we can get away from those pinned message CPUs
>> is a win)
>>
>> Also, meanwhile you discovered that the other part of your performance
>> woes were due to dl_server, specifically, disabling that gave you back a
>> healthy chunk of your performance.
>>
>> The problem is indeed that we toggle the dl_server on every nr_running
>> from 0 and to 0 transition, and your workload has a shit-ton of those,
>> so every time we get the overhead of starting and stopping this thing.
>>
>> In hindsight, that's a fairly stupid setup, and the below patch changes
>> this to keep the dl_server around until it's not seen fair activity for
>> a whole period. This appears to fully recover this dip.
>>
>> Trouble seems to be that dl_server_update() always gets tickled by
>> random garbage, so in the end the dl_server never stops... oh well.
>>
>> Juri, could you have a look at this, perhaps I messed up something
>> trivial -- its been like that this week :/
> 
> On the same VM I use as a SUT for the 'hammerdb-mysqld' tests:
> 
> https://lkml.kernel.org/r/d6692902-837a-4f30-913b-763f01a5a7ea@arm.com 
> 
> I can't spot any v6.11 related changes (dl_server or TTWU_QUEUE) but a
> PSI related one for v6.12 results in a ~8% schbench regression.
> 
> VM (m7gd.16xlarge, 16 logical CPUs) on Graviton3:
> 
> schbench -L -m 4 -M auto -t 128 -n 0 -r 60
> 
> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race

I also saw a regression on this one, but it wasn't stable enough for me
to be sure.  I'll retest, but I'm guessing this is made worse by the VM
/ graviton setup?

I've been testing Peter's changes, and they do help on my skylake box
but not as much on the big turin machines.  I'm trying to sort that out,
but we have some other variables wrt PGO/LTO that I need to rule out.

-chris