lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250325100225.GA1539283@bytedance>
Date: Tue, 25 Mar 2025 18:02:25 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Xi Wang <xii@...gle.com>
Cc: Chengming Zhou <chengming.zhou@...ux.dev>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Josh Don <joshdon@...gle.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

On Mon, Mar 24, 2025 at 04:58:22PM +0800, Aaron Lu wrote:
> On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
> ...
> > I am a bit unsure about the overhead experiment results. Maybe we can add some
> > counters to check how many cgroups per cpu are actually touched and how many
> > threads are actually dequeued / enqueued for throttling / unthrottling?
> 
> Sure thing.
> 
> > Looks like busy loop workloads were used for the experiment. With throttling
> > deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> > runtime debt can accumulate before the on cpu threads are actually dequeued.
> > (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
> > 
> > distribute_cfs_runtime would finish early if the quotas are used up by the first
> > few cpus, which would also result in throttling/unthrottling for only a few
> > runqueues per period. An intermittent workload like hackbench may give us more
> > information.
> 
> I've added some trace prints and noticed it already invovled almost all
> cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
> continue use that setup as described before:
> https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/

One more data point that might be interesting. I've tested this on a
v5.15 based kernel where async unthrottle is not available yet so things
should be worse.

As Xi mentioned, since the test program is cpu hog, I tweaked the quota
setting to make throttle happen more likely.

The bpftrace duration of distribute_cfs_runtime() is:

@durations:
[4K, 8K)               1 |                                                    |
[8K, 16K)              8 |                                                    |
[16K, 32K)             1 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           0 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)             0 |                                                    |
[1M, 2M)               0 |                                                    |
[2M, 4M)               0 |                                                    |
[4M, 8M)               0 |                                                    |
[8M, 16M)              0 |                                                    |
[16M, 32M)             0 |                                                    |
[32M, 64M)           376 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[64M, 128M)          824 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

One random trace point from the trace prints is:

          <idle>-0       [117] d.h1. 83206.734588: distribute_cfs_runtime: cpu117: begins
          <idle>-0       [117] dnh1. 83206.801902: distribute_cfs_runtime: cpu117: finishes: unthrottled_rqs=384, unthrottled_cfs_rq=422784, unthrottled_task=10000

So for the above trace point, distribute_cfs_runtime() unthrottled 384
rqs with a total of 422784 cfs_rqs and enqueued back 10000 tasks, this
took about 70ms.

Note that other things like rq lock contention might make things worse -
I did not notice any lock contention in this setup.

I've attached the corresponding debug diff in case it's not clear what
this trace print means.

View attachment "5.15.diff" of type "text/x-diff" (2899 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ