linux-kernel - Re: IPC drop down on AMD epyc 7702P

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3daac950-3656-4ec4-bbee-7a3bbad6d631@amd.com>
Date: Wed, 30 Apr 2025 14:43:00 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Jean-Baptiste Roquefere <jb.roquefere@...me.com>, Peter Zijlstra
	<peterz@...radead.org>, "mingo@...nel.org" <mingo@...nel.org>, Juri Lelli
	<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: Borislav Petkov <bp@...en8.de>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>, "Valentin
 Schneider" <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	"regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
	"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: IPC drop down on AMD epyc 7702P

(+ more scheduler folks)

tl;dr

JB has a workload that hates aggressive migration on the 2nd Generation
EPYC platform that has a small LLC domain (4C/8T) and very noticeable
C2C latency.

Based on JB's observation so far, reverting commit 16b0a7a1a0af
("sched/fair: Ensure tasks spreading in LLC during LB") and commit
c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") helps the workload. Both those commits allow aggressive
migrations for work conservation except it also increased cache
misses which slows the workload quite a bit.

"relax_domain_level" helps but cannot be set at runtime and I couldn't
think of any stable / debug interfaces that JB hasn't tried out
already that can help this workload.

There is a patch towards the end to set "relax_domain_level" at
runtime but given cpusets got away with this when transitioning to
cgroup-v2, I don't know what the sentiments are around its usage.
Any input / feedback is greatly appreciated.

On 4/28/2025 1:13 PM, Jean-Baptiste Roquefere wrote:
> Hello Prateek,
> 
> thank's for your reponse.
> 
> 
>> Looking at the commit logs, it looks like these commits do solve other
>> problems around load balancing and might not be trivial to revert
>> without evaluating the damages.
> 
> it's definitely not a productizable workaround !
> 
>> The processor you are running on, the AME EPYC 7702P based on the Zen2
>> architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
>> perhaps why reducing the thread count to below this limit is helping
>> your workload.
>>
>> What we suspect is that when running the workload, the threads that
>> regularly sleep trigger a newidle balancing which causes them to move
>> to another CCX leading to higher number of L3 misses.
>>
>> To confirm this, would it be possible to run the workload with the
>> not-yet-upstream perf sched stats [1] tool and share the result from
>> perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
>> to rule out any other second order effect.
>>
>> [1]
>> https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
> 
> I had to patch tools/perf/util/session.c : static int
> open_file_read(struct perf_data *data) due to "failed to open perf.data:
> File exists" (looked more like a compiler issue than a tool/perf issue)
> 
> $ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched >
> perf.diff (see perf.diff attached)

Thank you for all the information Jean. I'll highlight the interesting
bits (at least the bits that stood out to me)

(left is mainline, right is mainline with the two commits mentioned by
  JB reverted)

total runtime by tasks on this processor (in jiffies)            : 123927676874,108531911002  |   -12.42% |
total waittime by tasks on this processor (in jiffies)           :  34729211241, 27076295778  |   -22.04% |  (    28.02%,     24.95% )
total timeslices run on this cpu                                 :       501606,      489799  |    -2.35% |

Since "total runtime" is lower on the right, it means that the CPUs
were not as well utilized with the commits reverted however the
reduction in the "total waittime" suggests things are running faster
and on overage there are 0.28 waiting tasks on mainline compared to
0.24 with the commits reverted.

---------------------------------------- <Category newidle - SMT> ----------------------------------------
load_balance() count on cpu newly idle                           :      331664,      31153  |   -90.61% |  $        0.15,        1.55 $
load_balance() failed to find busier group on cpu newly idle     :      300234,      28470  |   -90.52% |  $        0.16,        1.70 $
*load_balance() success count on cpu newly idle                  :       28386,       1544  |   -94.56% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.01  |     0.46% |
---------------------------------------- <Category newidle - MC > ----------------------------------------
load_balance() count on cpu newly idle                           :      258017,      29345  |   -88.63% |  $        0.19,        1.65 $
load_balance() failed to find busier group on cpu newly idle     :      131096,      16081  |   -87.73% |  $        0.37,        3.01 $
*load_balance() success count on cpu newly idle                  :       23286,       2181  |   -90.63% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.03,       1.01  |    -1.23% |
---------------------------------------- <Category newidle - PKG> ----------------------------------------
load_balance() count on cpu newly idle                           :      124013,      27086  |   -78.16% |  $        0.39,        1.78 $
load_balance() failed to find busier group on cpu newly idle     :       11812,       3063  |   -74.07% |  $        4.09,       15.78 $
*load_balance() success count on cpu newly idle                  :       13892,       4739  |   -65.89% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.07,       1.10  |     3.32% |
----------------------------------------------------------------------------------------------------------

Most migrations are from newidle balancing which seems to move task
across cores ( > 50% of time) and the LLC too (~8% of the times).

> 
>> Assuming you control these deployments, would it possible to run
>> the workload on a kernel running with "relax_domain_level=2" kernel
>> cmdline that restricts newidle balance to only within the CCX. As a
>> side effect, it also limits  task wakeups to the same LLC domain but
>> I would still like to know if this makes a difference to the
>> workload you are running.
> On vanilla 6.12.17 it gives the IPC we expected:

Thank you JB for trying out this experiment. I'm not very sure what
the views are on "relax_domain_level" and I'm hoping the other
scheduler folks will chime in here - Is it a debug knob? Can it
be used in production?

I know it had additional uses with cpuset in cgroup-v1 but was not
adopted in v2 - are there any nasty historic reasons for this?

> 
> +--------------------+--------------------------+-----------------------+
> |                    | relax_domain_level unset | relax_domain_level=2  |
> +--------------------+--------------------------+-----------------------+
> | Threads            |  210                     | 210                  |
> | Utilization (%)    |  65,86                   | 52,01                |
> | CPU effective freq |  1 622,93                |  1 294,12             |
> | IPC                |  1,14                    | 1,42                 |
> | L2 access (pti)    |  34,36                   | 38,18                |
> | L2 miss   (pti)    |  7,34                    | 7,78                 |
> | L3 miss   (abs)    |  39 711 971 741          |  33 929 609 924       |
> | Mem (GB/s)         |  70,68                   | 49,10                |
> | Context switches   |  109 281 524             |  107 896 729          |
> +--------------------+--------------------------+-----------------------+
> 
> Kind regards,
> 
> JB

JB asked if there is any way to toggle "relax_domain_level" at runtime
on mainline and I couldn't find any easy way other than using cpusets
with cgroup-v1 which is probably harder to deploy at scale than the
pinning strategy that JB mentioned originally.

I currently cannot think of any stable interface that exists currently
to allow sticky behavior and mitigate aggressive migration for work
conservation - JB did try almost everything available that he
summarized in his original report.

Could something like below be a stop-gap band-aid to remedy such the
case of workloads that don't mind temporary imbalance in favor of
cache hotness?

---
From: K Prateek Nayak <kprateek.nayak@....com>
Subject: [RFC PATCH] sched/debug: Allow overriding "relax_domain_level" at runtime

Jean-Baptiste noted that Ateme's workload experiences poor IPC on a 2nd
Generation EPYC system and narrowed down the major culprits to commit
16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and
commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") both of which enable more aggressive migrations in favor of
work conservation.

The larger C2C latency on the platform coupled with a smaller L3 size of
4C/8T makes downside of aggressive balance very prominent. Looking at
the perf sched stats report from JB [1], when the two commits are
reverted, despite the "total runtime" seeing a dip of 11% showing a
better load distribution on mainline, the "total waittime" dips by 22%
showing despite the imbalance, the workload runs faster and this
improvement can be co-related to the higher IPC and the reduced L3
misses in data shared by JB. Most of the migration during load
balancing can be attributed to newidle balance.

JB confirmed that using "relax_domain_level=2" in kernel cmdline helps
this particular workload by restricting the scope of wakeups and
migrations during newidle balancing however "relax_domain_level" works
on topology levels before degeneration and setting the level before
inspecting the topology might not be trivial at boot time.

Furthermore, a runtime knob that can help quickly narrow down any changes
in workload behavior to aggressive migrations during load balancing can
be helpful during debugs.

Introduce "relax_domain_level" in sched debugfs and allow overriding the
knob at runtime.

   # cat /sys/kernel/debug/sched/relax_domain_level
   -1

   # echo Y > /sys/kernel/debug/sched/verbose
   # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

To restrict newidle balance to only within the LLC, "relax_domain_level"
can be set to level 3 (SMT, CLUSTER, *MC* , PKG, NUMA)

   # echo 3 > /sys/kernel/debug/sched/relax_domain_level
   # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
   SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

"relax_domain_level" forgives short term imbalances. Longer term
imbalances will be eventually caught by the periodic load balancer and
the system will reach a state of balance, only slightly later.

Link: https://lore.kernel.org/all/996ca8cb-3ac8-4f1b-93f1-415f43922d7a@ateme.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
  include/linux/sched/topology.h |  6 ++--
  kernel/sched/debug.c           | 52 ++++++++++++++++++++++++++++++++++
  kernel/sched/topology.c        |  2 +-
  3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..5f59bdc1d5b1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,10 @@ struct sched_domain_attr {
  	int relax_domain_level;
  };
  
-#define SD_ATTR_INIT	(struct sched_domain_attr) {	\
-	.relax_domain_level = -1,			\
+extern int default_relax_domain_level;
+
+#define SD_ATTR_INIT	(struct sched_domain_attr) {		\
+	.relax_domain_level = default_relax_domain_level,	\
  }
  
  extern int sched_domain_level_max;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..cc6944b35535 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -214,6 +214,57 @@ static const struct file_operations sched_scaling_fops = {
  	.release	= single_release,
  };
  
+DEFINE_MUTEX(relax_domain_mutex);
+
+static ssize_t sched_relax_domain_write(struct file *filp,
+					const char __user *ubuf,
+					size_t cnt, loff_t *ppos)
+{
+	int relax_domain_level;
+	char buf[16];
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtoint(buf, 10, &relax_domain_level))
+		return -EINVAL;
+
+	if (relax_domain_level < -1 || relax_domain_level > sched_domain_level_max + 1)
+		return -EINVAL;
+
+	guard(mutex)(&relax_domain_mutex);
+
+	if (relax_domain_level != default_relax_domain_level) {
+		default_relax_domain_level = relax_domain_level;
+		rebuild_sched_domains();
+	}
+
+	*ppos += cnt;
+	return cnt;
+}
+static int sched_relax_domain_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", default_relax_domain_level);
+	return 0;
+}
+
+static int sched_relax_domain_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_relax_domain_show, NULL);
+}
+
+static const struct file_operations sched_relax_domain_fops = {
+	.open		= sched_relax_domain_open,
+	.write		= sched_relax_domain_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
  #endif /* SMP */
  
  #ifdef CONFIG_PREEMPT_DYNAMIC
@@ -516,6 +567,7 @@ static __init int sched_init_debug(void)
  	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
  	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
  	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
+	debugfs_create_file("relax_domain_level", 0644, debugfs_sched, NULL, &sched_relax_domain_fops);
  
  	sched_domains_mutex_lock();
  	update_sched_domain_debugfs();
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..eb5c8a9cd904 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1513,7 +1513,7 @@ static void asym_cpu_capacity_scan(void)
   * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
   */
  
-static int default_relax_domain_level = -1;
+int default_relax_domain_level = -1;
  int sched_domain_level_max;
  
  static int __init setup_relax_domain_level(char *str)
-- 

Thanks and Regards,
Prateek