linux-kernel - [PATCH 0/7] Defer throttle when task exits to user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250520104110.3673059-1-ziqianlu@bytedance.com>
Date: Tue, 20 May 2025 18:41:03 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Josh Don <joshdon@...gle.com>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>
Cc: linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Mel Gorman <mgorman@...e.de>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>
Subject: [PATCH 0/7] Defer throttle when task exits to user

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"

Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/

To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
  hierarchy;
- stop accounting on unthrottle.

There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Changes since RFC v2:
- drop RFC tag;
- Change throttle time accounting as suggested by Chenming. Prateek
  helped simplify the implementation;
- Re-organize patchset to avoid breaking bisect as suggested by Prateek.

Patches:
To avoid breaking bisect for CFS bandwidth functionality related issue,
the patchset is reorganized. The previous patch2-3 is divided into
patch2-3 and patch5. Patch2 and 3 mostly did previous thing, except it
retained existing throttle functionality till patch5 which switched the
throttle model to task based. The end result should be: before patch5,
existing throttle model should still work correctly and since patch5,
task based throttle model takes over.

Details about each patch:

Patch1 is preparation work;

Patch2 prepares throttle path for task based throttle: when a cfs_rq is
to be throttled, mark throttled status for this cfs_rq and when tasks in
throttled hierarchy gets picked, add a task work to them so that when
those tasks return to user, the task work can throttle it by dequeuing
the task and remember this by adding the task to its cfs_rq's limbo list.
To avoid breaking bisect, patch2 retained existing throttle behaviour in
that throttled hierarchy is still removed from runqueue, the actual
behaviour change is done in patch5;

Patch3 prepares unthrottle path for task based throttle: when a cfs_rq is
unthrottled, enqueue back those tasks in limbo list. It didn't take effect
though since no tasks are placed on that limbo list yet in patch2;

Patch4 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group or sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.

Patch5 implements the actual throttle model change.

Patch6 implements throttle time accounting for this new task based
throttle model.

Patch7 is a cleanup after throttle model changed to task based.

All change logs are written by me and if they read bad, it's my fault.

Comments are welcome.

Base: tip/sched/core commit 676e8cf70cb0("sched,livepatch: Untangle
cond_resched() and live-patching")

Aaron Lu (3):
  sched/fair: Take care of group/affinity/sched_class change for
    throttled task
  sched/fair: task based throttle time accounting
  sched/fair: get rid of throttled_lb_pair()

Valentin Schneider (4):
  sched/fair: Add related data structure for task based throttle
  sched/fair: prepare throttle path for task based throttle
  sched/fair: prepare unthrottle path for task based throttle
  sched/fair: switch to task based throttle model

 include/linux/sched.h |   4 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 403 ++++++++++++++++++++----------------------
 kernel/sched/sched.h  |   4 +
 4 files changed, 199 insertions(+), 215 deletions(-)

-- 
2.39.5