linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOBoifh+VryF+R4tY=GHtzgO+nRAqER2sXjPCVvhtCgOyyG0Zg@mail.gmail.com>
Date: Thu, 27 Mar 2025 17:11:42 -0700
From: Xi Wang <xii@...gle.com>
To: Aaron Lu <ziqianlu@...edance.com>
Cc: Chengming Zhou <chengming.zhou@...ux.dev>, K Prateek Nayak <kprateek.nayak@....com>, 
	Josh Don <joshdon@...gle.com>, Valentin Schneider <vschneid@...hat.com>, 
	Ben Segall <bsegall@...gle.com>, Peter Zijlstra <peterz@...radead.org>, 
	Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Mel Gorman <mgorman@...e.de>, Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

On Tue, Mar 25, 2025 at 3:02 AM Aaron Lu <ziqianlu@...edance.com> wrote:
>
> On Mon, Mar 24, 2025 at 04:58:22PM +0800, Aaron Lu wrote:
> > On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
> > ...
> > > I am a bit unsure about the overhead experiment results. Maybe we can add some
> > > counters to check how many cgroups per cpu are actually touched and how many
> > > threads are actually dequeued / enqueued for throttling / unthrottling?
> >
> > Sure thing.
> >
> > > Looks like busy loop workloads were used for the experiment. With throttling
> > > deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> > > runtime debt can accumulate before the on cpu threads are actually dequeued.
> > > (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@redhat.com/)
> > >
> > > distribute_cfs_runtime would finish early if the quotas are used up by the first
> > > few cpus, which would also result in throttling/unthrottling for only a few
> > > runqueues per period. An intermittent workload like hackbench may give us more
> > > information.
> >
> > I've added some trace prints and noticed it already invovled almost all
> > cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
> > continue use that setup as described before:
> > https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
>
> One more data point that might be interesting. I've tested this on a
> v5.15 based kernel where async unthrottle is not available yet so things
> should be worse.
>
> As Xi mentioned, since the test program is cpu hog, I tweaked the quota
> setting to make throttle happen more likely.
>
> The bpftrace duration of distribute_cfs_runtime() is:
>
> @durations:
> [4K, 8K)               1 |                                                    |
> [8K, 16K)              8 |                                                    |
> [16K, 32K)             1 |                                                    |
> [32K, 64K)             0 |                                                    |
> [64K, 128K)            0 |                                                    |
> [128K, 256K)           0 |                                                    |
> [256K, 512K)           0 |                                                    |
> [512K, 1M)             0 |                                                    |
> [1M, 2M)               0 |                                                    |
> [2M, 4M)               0 |                                                    |
> [4M, 8M)               0 |                                                    |
> [8M, 16M)              0 |                                                    |
> [16M, 32M)             0 |                                                    |
> [32M, 64M)           376 |@@@@@@@@@@@@@@@@@@@@@@@                             |
> [64M, 128M)          824 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> One random trace point from the trace prints is:
>
>           <idle>-0       [117] d.h1. 83206.734588: distribute_cfs_runtime: cpu117: begins
>           <idle>-0       [117] dnh1. 83206.801902: distribute_cfs_runtime: cpu117: finishes: unthrottled_rqs=384, unthrottled_cfs_rq=422784, unthrottled_task=10000
>
> So for the above trace point, distribute_cfs_runtime() unthrottled 384
> rqs with a total of 422784 cfs_rqs and enqueued back 10000 tasks, this
> took about 70ms.
>
> Note that other things like rq lock contention might make things worse -
> I did not notice any lock contention in this setup.
>
> I've attached the corresponding debug diff in case it's not clear what
> this trace print means.

Thanks for getting the test results!

My understanding is that you now have 2 test configurations and new vs
old throttling mechanisms. I think the two groups of results were
test1 + new method and test2 + old method. Is that the case?

For test1 + new method, we have "..in distribute_cfs_runtime(), 383
rqs are involved and the local cpu has unthrottled 1101 cfs_rqs and a
total of 69 tasks are enqueued back". I think if the workload is in a
steady and persistently over limit state we'd have 1000+ tasks
periodically being throttled and unthrottled, at least with the old
method. So "1101 cfs_rqs and a total of 69 tasks are enqueued back"
might be a special case?

I'll also try to create a stress test that mimics our production
problems but it would take some time.