linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250324085822.GA732629@bytedance>
Date: Mon, 24 Mar 2025 16:58:22 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Xi Wang <xii@...gle.com>
Cc: Chengming Zhou <chengming.zhou@...ux.dev>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Josh Don <joshdon@...gle.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
...
> I am a bit unsure about the overhead experiment results. Maybe we can add some
> counters to check how many cgroups per cpu are actually touched and how many
> threads are actually dequeued / enqueued for throttling / unthrottling?

Sure thing.

> Looks like busy loop workloads were used for the experiment. With throttling
> deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> runtime debt can accumulate before the on cpu threads are actually dequeued.
> (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
> 
> distribute_cfs_runtime would finish early if the quotas are used up by the first
> few cpus, which would also result in throttling/unthrottling for only a few
> runqueues per period. An intermittent workload like hackbench may give us more
> information.

I've added some trace prints and noticed it already invovled almost all
cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
continue use that setup as described before:
https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/

Below is one print sample:

          <idle>-0       [214] d.h1.  1879.281972: distribute_cfs_runtime: cpu214: begins                                         <idle>-0       [214] dNh2.  1879.283564: distribute_cfs_runtime: cpu214: finishes. unthrottled rqs=383, unthro
ttled_cfs_rq=1101, unthrottled_task=69

With async unthrottle, it's not possible to account exactly how many
cfs_rqs are unthrottled and how many tasks are enqueued back, just
how many rqs are involved and how many cfs_rqs/tasks the local cpu has
unthrottled. So this sample means in distribute_cfs_runtime(), 383 rqs
are involved and the local cpu has unthrottled 1101 cfs_rqs and a total
of 69 tasks are enqueued back.

The corresponding bpftrace(duration of distribute_cfs_runtime(), in
nano-seconds) is:
@durations:
[4K, 8K)               9 |                                                    |
[8K, 16K)            227 |@@@@@@@@@@@@@@                                      |
[16K, 32K)           120 |@@@@@@@                                             |
[32K, 64K)            70 |@@@@                                                |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)           158 |@@@@@@@@@                                           |
[1M, 2M)             832 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M)             177 |@@@@@@@@@@@                                         |

Thanks,
Aaron

> See slide 10 of my presentation for more info:
> https://lpc.events/event/18/contributions/1883/attachments/1420/3040/Priority%20Inheritance%20for%20CFS%20Bandwidth%20Control.pdf
> 
> Indeed we are seeing more cfsb scalability problems with larger servers. Both
> the irq off time from the cgroup walk and the overheads from per task actions
> can be problematic.
> 
> -Xi

View attachment "upstream.diff" of type "text/x-diff" (2513 bytes)