linux-kernel - Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250822110701.GB289@bytedance>
Date: Fri, 22 Aug 2025 19:07:01 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote:
> On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
... ...
> > I would also suggest running similar benchmarks but with deeper
> > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> > when tg_unthrottle_up() goes up a bigger tree.
> 
> No problem.
> 
> I suppose I can reuse the previous shared test script:
> https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
> 
> There I used:
> nr_level1=2
> nr_level2=100
> nr_level3=10
> 
> But I can tweak these numbers for this performance evaluation. I can make
> the leaf level to be 5 level deep and place tasks in leaf level cgroups
> and configure quota on 1st level cgroups.

Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2
sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to
performance and cpuidle states all disabled.

cgroup hierarchy:
nr_level1=2
nr_level2=2
nr_level3=2
nr_level4=5
nr_level5=5
i.e. two cgroups in the root level, with each level1 cgroup having 2
child cgroups, and each level2 cgroup having 2 child cgroups, etc. This
creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf
cgroups. Quota are set on the two level1 cgroups.

The TLDR is, when there is a very large number of tasks(like 8000 tasks),
task based throttle saw 10-20% performance drop on AMD Genoa; otherwise,
no obvious performance change is observed. Detailed test results below.

Netperf: measured in throughput, more is better
- quota set to 50 cpu for each level1 cgroup;
- each leaf cgroup run a pair of netserver and netperf with following
  cmdline:
  netserver -p $port_for_this_cgroup
  netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30
  i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400
  tasks.

On Intel EMR:
              base            head         diff
throughput    33305±8.40%     33995±7.84%  noise

On AMD Genoa:
              base            head         diff
throughput    5013±1.16%      4967±1.82    noise

Hackbench, measured in seconds, less is better:
- quota set to 50cpu for each level1 cgroup;
- each cgroup runs with the following cmdline:
  hackbench -p -g 1 -l $see_below
i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total
task number is 40 * 200 = 8000 tasks.

On Intel EMR(loops set to 100000):

         base             head              diff
Time     85.45±3.98%      86.41±3.98%       noise

On AMD Genoa(loops set to 20000):

         base             head              diff
Time     104±4.33%        116±7.71%        -11.54%

So for this test case, task based throttle suffered ~10% performance
drop. I also tested on another AMD Genoa(same cpu spec) to make sure
it's not a machine problem and performance dropped there too:

On 2nd AMD Genoa(loops set to 50000)

        base             head               diff
Time    81±3.13%         101±7.05%         -24.69%

According to perf, __schedule() in head takes 7.29% cycles while in base
it takes 4.61% cycles. I suppose with task based throttle, __schedule()
is more frequent since tasks in a throttled cfs_rq have to be dequeued
one by one while in current behaviour, the cfs_rq can be dequeued off rq
in one go. This is most obvious when there are multiple tasks in a single
cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the
same for the two throttling model.

With this said, I reduced the task number and retested on this 2nd AMD
Genoa:
- quota set to 50 cpu for each level1 cgroup;
- using only 1 fd pair, i.e. 2 task for each cgroup:
  hackbench -p -g 1 -f 1 -l 50000000
  i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
  task number is 2 * 200 = 400 tasks.

        base             head               diff
Time    127.77±2.60%     127.49±2.63%       noise

In this setup, performance is about the same.

Now I'm wondering why on Intel EMR, running that extreme setup(8000
tasks), performance of task based throttle didn't see noticeable drop...