linux-kernel - Re: IPC drop down on AMD epyc 7702P

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f94a10cb-e65d-4697-875e-43f624f79099@oracle.com>
Date: Wed, 30 Apr 2025 19:46:06 -0700
From: Libo Chen <libo.chen@...cle.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
        Jean-Baptiste Roquefere <jb.roquefere@...me.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "mingo@...nel.org"
 <mingo@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc: Borislav Petkov <bp@...en8.de>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        "Gautham R. Shenoy" <gautham.shenoy@....com>,
        Swapnil Sapkal <swapnil.sapkal@....com>,
        Valentin Schneider <vschneid@...hat.com>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        Konrad Wilk <konrad.wilk@...cle.com>
Subject: Re: IPC drop down on AMD epyc 7702P

Hi Prateek,

On 4/30/25 04:29, K Prateek Nayak wrote:
> Hello Libo,
> 
> On 4/30/2025 4:11 PM, Libo Chen wrote:
>>
>>
>> On 4/30/25 02:13, K Prateek Nayak wrote:
>>> (+ more scheduler folks)
>>>
>>> tl;dr
>>>
>>> JB has a workload that hates aggressive migration on the 2nd Generation
>>> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
>>> C2C latency.
>>>
>>> Based on JB's observation so far, reverting commit 16b0a7a1a0af
>>> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
>>> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
>>> condition") helps the workload. Both those commits allow aggressive
>>> migrations for work conservation except it also increased cache
>>> misses which slows the workload quite a bit.
>>>
>>> "relax_domain_level" helps but cannot be set at runtime and I couldn't
>>> think of any stable / debug interfaces that JB hasn't tried out
>>> already that can help this workload.
>>>
>>> There is a patch towards the end to set "relax_domain_level" at
>>> runtime but given cpusets got away with this when transitioning to
>>> cgroup-v2, I don't know what the sentiments are around its usage.
>>> Any input / feedback is greatly appreciated.
>>>
>>
>>
>> Hi Prateek,
>>
>> Oh no, not "relax_domain_level" again, this can lead to load imbalance
>> in variety of ways. We were so glad this one went away with cgroupv2,
> 
> I agree it is not pretty. JB also tried strategic pinning and they
> did report that things are better overall but unfortunately, it is
> very hard to deploy across multiple architectures and would also
> require some redesign + testing from their application side.
> 

I was more of stressing broadly how bad setting "relax_domain_level"
could go wrong if an user doesn't know this essentially disables newidle
balancing at higher levels, so the ability to balance loads across CCXes
or NUMA nodes will be a lot weaker. A subset of CCXes may consistently
get much more loads due to a whole bunch of reasons. Sometimes this is
hard to spot in testing, but does show up in real-world scenarios, esp.
when users have other weird hacks.

>> it tends to be abused by users as an "easy" fix for some urgent perf
>> issues instead of addressing their root causes.
> 
> Was there ever a report of similar issue where migrations for right
> reasons has led to performance degradation as a result of platform
> architecture? I doubt there is a straightforward way to solve this
> using the current interfaces - at least I haven't found one yet.
> 

It wasn't due to platform architecture for us but more of "exotic" NUMA
topology (like a cubic, a node is one hop away from 3 neighbors, two
hops away from other 4) in combination with certain userlevel settings
that cause more wakeups in a subset of domains. If relax_domain_level
is left untouched, then you get no load imbalance but perf is bad. But
once you set relax_domain_level to restrict newidle balancing to lower
domain levels, you actually see better performance numbers in testing
even though CPU loads are not well-balanced. Until one day, you find
out the imbalance is so bad that it slows down everything. Luckily it
wasn't too hard to fix from the application side.

I get it may not be easy to fix from their application side in this
case and but I still think this is too hackery, one may end up
regretting. 

I certainly want to hear what others think about relax_domain_level!

> Perhaps cache-aware scheduling is the way forward to solve these
> set of issues as Peter highlighted.
> 

Hope so! We will start test that series and provide feedback

Thanks,
Libo