linux-kernel - Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ded06050-83c1-49f3-8eb2-bb086930e226@amd.com>
Date: Mon, 14 Apr 2025 22:04:02 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Valentin Schneider
	<vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>, Peter Zijlstra
	<peterz@...radead.org>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>
CC: <linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Chuyi Zhou <zhouchuyi@...edance.com>, Jan Kiszka
	<jan.kiszka@...mens.com>
Subject: Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user

Hello Aaron,

On 4/9/2025 5:37 PM, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.

I tried to reproduce the scenario that Valentin describes in the
parallel thread [1] and I haven't run into a stall yet with this
series applied on top of v6.15-rc1 [2].

So for Patch 1-6, feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

I'm slowly crawling through the series and haven't gotten to the Patch 7
yet but so far I haven't seen anything unexpected in my initial testing
and it seems to solve possible circular dependency on PREEMPT_RT with
bandwidth replenishment (Sebastian has some doubts if my reproducer is
correct but that discussion is for the other thread)

Thank you for working on this and a big thanks to Valentin for the solid
groundwork.

[1] https://lore.kernel.org/linux-rt-users/f2e2c74c-b15d-4185-a6ea-4a19eee02417@amd.com/
[2] https://lore.kernel.org/linux-rt-users/534df953-3cfb-4b3d-8953-5ed9ef24eabc@amd.com/

> 
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
> 
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
> 
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
>    throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
>    gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
>    task when it gets throttled and eventually, that task doesn't return
>    to user but blocks, then this cfs_rq has no tasks on throttled list
>    but time is accounted as throttled; Patch2 and patch3 implements this
>    accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
>    its throttled list; stops accounting when it's unthrottled. This kind
>    of over accounts throttled time because partial throttle state is
>    accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
>    throttled list is not empty; stops accounting when this cfs_rq is
>    unthrottled; This kind of under accounts throttled time because partial
>    throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.
> 
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
> 

[..snip..]

-- 
Thanks and Regards,
Prateek