linux-kernel - Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1113e83b-1dee-4c40-8e9c-674b33b87ba8@meta.com>
Date: Sun, 15 Jun 2025 20:35:17 -0400
From: Chris Mason <clm@...a.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish
 regressions

On 6/14/25 6:04 AM, Peter Zijlstra wrote:
> On Wed, May 28, 2025 at 09:41:33PM -0400, Chris Mason wrote:
> 
>> I'll get all of these run on the big turin machine, should have some
>> numbers tomorrow.
> 
> Right... so Turin. I had a quick look through our IRC logs but I
> couldn't find exactly which model you had, and unfortunately AMD uses
> the Turin name for both Zen 5c and Zen 5 Epyc :-(

Looks like the one I've been testing on is the epyc variant.  But,
stepping back for a bit, I bisected a few regressions between 6.9 and 6.13:

- DL server
- DELAY_{DEQUEUE,ZERO}
- PSI fun (not really me, but relevant)

I think these are all important and relevant, but it was strange that
none of these patches seemed to move the needle much on the turin
machines (+/- futexes), so I went back to the drawing board.

Our internal 6.9 kernel was the "fast" one, and comparing it with
vanilla 6.9, it turns out we'd carried some patches that significantly
improved our web workload on top of vanilla 6.9.

In other words, I've been trying to find a regression that Vernet
actually fixed in 6.9 already.  Bisecting pointed to:

Author: David Vernet <void@...ifault.com>
Date:   Tue May 7 08:15:32 2024 -0700

    Revert "sched/fair: Remove sysctl_sched_migration_cost condition"

    This reverts commit c5b0a7eefc70150caf23e37bc9d639c68c87a097.

Comparing schedstat.py output from the fast and slow kernel..this is 6.9
vs 6.13, but I'll get a comparison tomorrow where the schedstat version
actually matches.

# grep balance slow.stat.out
lb_balance_not_idle: 687
lb_imbalance_not_idle: 0
lb_balance_idle: 659054
lb_imbalance_idle: 2635
lb_balance_newly_idle: 2051682
lb_imbalance_newly_idle: 500328
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

# grep balance fast.stat.out
lb_balance_idle: 606600
lb_imbalance_idle: 1911
lb_balance_not_idle: 680
lb_imbalance_not_idle: 0
lb_balance_newly_idle: 11697
lb_imbalance_newly_idle: 22868
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

Reverting that commit above on vanilla 6.9 makes us fast.  Disabling new
idle balance completely is fast on our 6.13 kernel, but reverting that
one commit doesn't change much.  I'll switch back to upstream and
compare newidle balance behavior.

> 
> Anyway, the big and obvious difference between the whole Intel and AMD
> machines is the L3. So far we've been looking at SKL/SPR single L3
> performance, but Turin (whichever that might be) will be having many L3.
> With Zen5 having 8 cores per L3 and Zen5c having 16.
> 
> Additionally, your schbench -M auto thing is doing exactly the wrong
> thing for them. What you want is for those message threads to be spread
> out across the L3s, not all stuck to the first (which is what -M auto
> would end up doing). And then the associated worker threads would
> ideally stick to their respective L3s and not scatter all over the
> machine.
> 
> Anyway, most of the data we shared was about single socket SKL, might be
> we missed some obvious things for the multi-L3 case.
> 
> I'll go poke at some of the things I've so far neglected because of the
> single L3 focus.

You're 100% right about all of this, and I really do want to add some
better smarts to the pinning for both numa and chiplets.

-chris