linux-kernel - Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250903071410.GA42@bytedance>
Date: Wed, 3 Sep 2025 15:14:10 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
> On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote:
> > On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
> ... ...
> > > I would also suggest running similar benchmarks but with deeper
> > > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> > > when tg_unthrottle_up() goes up a bigger tree.
> > 
> > No problem.
> > 
> > I suppose I can reuse the previous shared test script:
> > https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
> > 
> > There I used:
> > nr_level1=2
> > nr_level2=100
> > nr_level3=10
> > 
> > But I can tweak these numbers for this performance evaluation. I can make
> > the leaf level to be 5 level deep and place tasks in leaf level cgroups
> > and configure quota on 1st level cgroups.
> 
> Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2
> sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to
> performance and cpuidle states all disabled.
> 
> cgroup hierarchy:
> nr_level1=2
> nr_level2=2
> nr_level3=2
> nr_level4=5
> nr_level5=5
> i.e. two cgroups in the root level, with each level1 cgroup having 2
> child cgroups, and each level2 cgroup having 2 child cgroups, etc. This
> creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf
> cgroups. Quota are set on the two level1 cgroups.
> 
> The TLDR is, when there is a very large number of tasks(like 8000 tasks),
> task based throttle saw 10-20% performance drop on AMD Genoa; otherwise,
> no obvious performance change is observed. Detailed test results below.
> 
> Netperf: measured in throughput, more is better
> - quota set to 50 cpu for each level1 cgroup;
> - each leaf cgroup run a pair of netserver and netperf with following
>   cmdline:
>   netserver -p $port_for_this_cgroup
>   netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30
>   i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400
>   tasks.
> 
> On Intel EMR:
>               base            head         diff
> throughput    33305±8.40%     33995±7.84%  noise
> 
> On AMD Genoa:
>               base            head         diff
> throughput    5013±1.16%      4967±1.82    noise
> 
> 
> Hackbench, measured in seconds, less is better:
> - quota set to 50cpu for each level1 cgroup;
> - each cgroup runs with the following cmdline:
>   hackbench -p -g 1 -l $see_below
> i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total
> task number is 40 * 200 = 8000 tasks.
> 
> On Intel EMR(loops set to 100000):
> 
>          base             head              diff
> Time     85.45±3.98%      86.41±3.98%       noise
> 
> On AMD Genoa(loops set to 20000):
> 
>          base             head              diff
> Time     104±4.33%        116±7.71%        -11.54%
> 
> So for this test case, task based throttle suffered ~10% performance
> drop. I also tested on another AMD Genoa(same cpu spec) to make sure
> it's not a machine problem and performance dropped there too:
> 
> On 2nd AMD Genoa(loops set to 50000)
> 
>         base             head               diff
> Time    81±3.13%         101±7.05%         -24.69%
> 
> According to perf, __schedule() in head takes 7.29% cycles while in base
> it takes 4.61% cycles. I suppose with task based throttle, __schedule()
> is more frequent since tasks in a throttled cfs_rq have to be dequeued
> one by one while in current behaviour, the cfs_rq can be dequeued off rq
> in one go. This is most obvious when there are multiple tasks in a single
> cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the
> same for the two throttling model.
> 
> With this said, I reduced the task number and retested on this 2nd AMD
> Genoa:
> - quota set to 50 cpu for each level1 cgroup;
> - using only 1 fd pair, i.e. 2 task for each cgroup:
>   hackbench -p -g 1 -f 1 -l 50000000
>   i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
>   task number is 2 * 200 = 400 tasks.
> 
>         base             head               diff
> Time    127.77±2.60%     127.49±2.63%       noise
> 
> In this setup, performance is about the same.
> 
> Now I'm wondering why on Intel EMR, running that extreme setup(8000
> tasks), performance of task based throttle didn't see noticeable drop...

Looks like hackbench doesn't like task migration on this AMD system
(domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).

If I revert patch5, running this 40 * 200 = 8000 hackbench workload
again, performance is roughly the same now(head~1 is slightly worse but
given the 4+% stddev in base, it can be considered in noise range):

         base              head~1(patch1-4)    diff     head(patch1-5)
Time     82.55±4.82%       84.45±2.70%         -2.3%    99.69±6.71%

According to /proc/schedstat, the lb_gained for domain2 is:

         NOT_IDLE IDLE  NEWLY_IDLE
base        0     8052    81791    
head~1      0     7197   175096
head        1    14818   793065

Other domains have similar number: base has smallest migration number
while head has the most and head~1 reduce the number a lot. I suppose
this is expected, because we removed the throttled_lb_pair() restriction
in patch5 and that can cause runnable tasks in throttled hierarchy to be
balanced to other cpus while in base, this can not happen.

I think patch5 still makes sense and is correct, it's just this specific
workload doesn't like task migrations. Intel EMR doesn't suffer from
this, I suppose that's because EMR has a much larger LLC while AMD Genoa
has a relatively small LLC and task migrations across LLC boundary hurts
hackbench's performance.

I also tried to apply below hack to prove this "task migration across
LLC boundary hurts hackbench" theory on both base and head:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..34c5f6b75e53d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9297,6 +9297,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
+	if (!(env->sd->flags & SD_SHARE_LLC))
+		return 0;
+
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
 
With this diff applied, the result is:


         base'              head'              diff
Time     74.78±8.2%         78.87±15.4%        -5.47%

base': base + above diff
head': head + above diff

So both performs better now, but with much larger variance, I guess
that's because no load balance on domain2 and above now. head' is still
worse than base, but not as much as before.

To conclude this: hackbench doesn't like task migration, especially when
task is migrated across LLC boundary. patch5 removed the restriction of
no balancing throttled tasks, this caused more balance to happen and
hackbench doesn't like this. But balancing has its own merit and could
still benefit other workloads so I think patch5 should stay, especially
considering that when throttled tasks are eventually dequeued, they will
not stay on rq's cfs_tasks list so no need to take special care for them
when doing load balance.

On a side note: should we increase the cost of balancing tasks out of LLC
boundary? I tried to enlarge sysctl_sched_migration_cost 100 times for
domains without SD_SHARE_LLC in task_hot() but that didn't help.