lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250822110701.GB289@bytedance>
Date: Fri, 22 Aug 2025 19:07:01 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote:
> On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
... ...
> > I would also suggest running similar benchmarks but with deeper
> > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> > when tg_unthrottle_up() goes up a bigger tree.
> 
> No problem.
> 
> I suppose I can reuse the previous shared test script:
> https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
> 
> There I used:
> nr_level1=2
> nr_level2=100
> nr_level3=10
> 
> But I can tweak these numbers for this performance evaluation. I can make
> the leaf level to be 5 level deep and place tasks in leaf level cgroups
> and configure quota on 1st level cgroups.

Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2
sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to
performance and cpuidle states all disabled.

cgroup hierarchy:
nr_level1=2
nr_level2=2
nr_level3=2
nr_level4=5
nr_level5=5
i.e. two cgroups in the root level, with each level1 cgroup having 2
child cgroups, and each level2 cgroup having 2 child cgroups, etc. This
creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf
cgroups. Quota are set on the two level1 cgroups.

The TLDR is, when there is a very large number of tasks(like 8000 tasks),
task based throttle saw 10-20% performance drop on AMD Genoa; otherwise,
no obvious performance change is observed. Detailed test results below.

Netperf: measured in throughput, more is better
- quota set to 50 cpu for each level1 cgroup;
- each leaf cgroup run a pair of netserver and netperf with following
  cmdline:
  netserver -p $port_for_this_cgroup
  netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30
  i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400
  tasks.

On Intel EMR:
              base            head         diff
throughput    33305±8.40%     33995±7.84%  noise

On AMD Genoa:
              base            head         diff
throughput    5013±1.16%      4967±1.82    noise


Hackbench, measured in seconds, less is better:
- quota set to 50cpu for each level1 cgroup;
- each cgroup runs with the following cmdline:
  hackbench -p -g 1 -l $see_below
i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total
task number is 40 * 200 = 8000 tasks.

On Intel EMR(loops set to 100000):

         base             head              diff
Time     85.45±3.98%      86.41±3.98%       noise

On AMD Genoa(loops set to 20000):

         base             head              diff
Time     104±4.33%        116±7.71%        -11.54%

So for this test case, task based throttle suffered ~10% performance
drop. I also tested on another AMD Genoa(same cpu spec) to make sure
it's not a machine problem and performance dropped there too:

On 2nd AMD Genoa(loops set to 50000)

        base             head               diff
Time    81±3.13%         101±7.05%         -24.69%

According to perf, __schedule() in head takes 7.29% cycles while in base
it takes 4.61% cycles. I suppose with task based throttle, __schedule()
is more frequent since tasks in a throttled cfs_rq have to be dequeued
one by one while in current behaviour, the cfs_rq can be dequeued off rq
in one go. This is most obvious when there are multiple tasks in a single
cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the
same for the two throttling model.

With this said, I reduced the task number and retested on this 2nd AMD
Genoa:
- quota set to 50 cpu for each level1 cgroup;
- using only 1 fd pair, i.e. 2 task for each cgroup:
  hackbench -p -g 1 -f 1 -l 50000000
  i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
  task number is 2 * 200 = 400 tasks.

        base             head               diff
Time    127.77±2.60%     127.49±2.63%       noise

In this setup, performance is about the same.

Now I'm wondering why on Intel EMR, running that extreme setup(8000
tasks), performance of task based throttle didn't see noticeable drop...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ