[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4d0e1fa3-1faa-4dd2-95a1-00e7ca48aa42@linux.dev>
Date: Mon, 14 Apr 2025 11:05:30 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Aaron Lu <ziqianlu@...edance.com>,
Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>,
Ingo Molnar <mingo@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Xi Wang <xii@...gle.com>
Cc: linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>
Subject: Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user
On 2025/4/9 20:07, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
>
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.
>
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
>
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
Seems reasonable to me, although I'm wondering is it possible or desirable
to implement per-task PELT freeze?
>
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
> throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
> gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
> task when it gets throttled and eventually, that task doesn't return
> to user but blocks, then this cfs_rq has no tasks on throttled list
> but time is accounted as throttled; Patch2 and patch3 implements this
> accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
> its throttled list; stops accounting when it's unthrottled. This kind
> of over accounts throttled time because partial throttle state is
> accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
> throttled list is not empty; stops accounting when this cfs_rq is
> unthrottled; This kind of under accounts throttled time because partial
> throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.
I personally prefer option 2, which has a more practical throttled time,
so we can know how long there are some tasks throttled in fact.
Thanks!
>
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
>
> Patches:
> Patch1 is preparation work;
>
> Patch2-3 provide the main functionality.
> Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> throttled status for this cfs_rq and when tasks in throttled hierarchy
> gets picked, add a task work to them so that when those tasks return to
> user space, the task work can throttle it by dequeuing the task and
> remember this by adding the task to its cfs_rq's limbo list;
> Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> enqueue back those tasks in limbo list;
>
> Patch4 deals with the dequeue path when task changes group, sched class
> etc. Task that is throttled is dequeued in fair, but task->on_rq is
> still set so when it changes task group or sched class or has affinity
> setting change, core will firstly dequeue it. But since this task is
> already dequeued in fair class, this patch handle this situation.
>
> Patch5-6 are clean ups. Some code are obsolete after switching to task
> based throttle mechanism.
>
> Patch7 implements an alternative accounting mechanism for task based
> throttle.
>
> Changes since v1:
> - Move "add task work" from throttle time to pick time, suggested by
> Chengming Zhou;
> - Use scope_gard() and cond_resched_tasks_rcu_qs() in
> throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
> suggested by K Prateek Nayak;
> - Fix h_nr_runnable accounting for delayed dequeue case when task based
> throttle is in use;
> - Implemented an alternative way of throttle time accounting for
> discussion purpose;
> - Make !CONFIG_CFS_BANDWIDTH build.
> I hope I didn't omit any feedbacks I've received, but feel free to let me
> know if I did.
>
> As in v1, all change logs are written by me and if they read bad, it's
> my fault.
>
> Comments are welcome.
>
> Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> use of more than one housekeeping cpu").
>
> Aaron Lu (4):
> sched/fair: Take care of group/affinity/sched_class change for
> throttled task
> sched/fair: get rid of throttled_lb_pair()
> sched/fair: fix h_nr_runnable accounting with per-task throttle
> sched/fair: alternative way of accounting throttle time
>
> Valentin Schneider (3):
> sched/fair: Add related data structure for task based throttle
> sched/fair: Handle throttle path for task based throttle
> sched/fair: Handle unthrottle path for task based throttle
>
> include/linux/sched.h | 4 +
> kernel/sched/core.c | 3 +
> kernel/sched/fair.c | 449 ++++++++++++++++++++++--------------------
> kernel/sched/sched.h | 7 +
> 4 files changed, 248 insertions(+), 215 deletions(-)
>
Powered by blists - more mailing lists