linux-kernel - Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4efdc1a8-b624-4857-93cb-c40da6252983@intel.com>
Date: Sun, 17 Aug 2025 16:50:50 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: <linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
 Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Valentin Schneider <vschneid@...hat.com>, "Ben
 Segall" <bsegall@...gle.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Peter Zijlstra" <peterz@...radead.org>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, "Vincent Guittot" <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

On 7/15/2025 3:16 PM, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@...hat.com>
> 
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
> 
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
> 
> Throttled cfs_rq's PELT clock is handled differently now: previously the
> cfs_rq's PELT clock is stopped once it entered throttled state but since
> now tasks(in kernel mode) can continue to run, change the behaviour to
> stop PELT clock only when the throttled cfs_rq has no tasks left.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
> Suggested-by: Chengming Zhou <chengming.zhou@...ux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@...hat.com>
> Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
> ---

[snip]


> @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>   {
>   	struct sched_entity *se;
>   	struct cfs_rq *cfs_rq;
> +	struct task_struct *p;
> +	bool throttled;
>   
>   again:
>   	cfs_rq = &rq->cfs;
>   	if (!cfs_rq->nr_queued)
>   		return NULL;
>   
> +	throttled = false;
> +
>   	do {
>   		/* Might not have done put_prev_entity() */
>   		if (cfs_rq->curr && cfs_rq->curr->on_rq)
>   			update_curr(cfs_rq);
>   
> -		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> -			goto again;
> +		throttled |= check_cfs_rq_runtime(cfs_rq);
>   
>   		se = pick_next_entity(rq, cfs_rq);
>   		if (!se)
> @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>   		cfs_rq = group_cfs_rq(se);
>   	} while (cfs_rq);
>   
> -	return task_of(se);
> +	p = task_of(se);
> +	if (unlikely(throttled))
> +		task_throttle_setup_work(p);
> +	return p;
>   }
>   

Previously, I was wondering if the above change might impact
wakeup latency in some corner cases: If there are many tasks
enqueued on a throttled cfs_rq, the above pick-up mechanism
might return an invalid p repeatedly (where p is dequeued,
and a reschedule is triggered in throttle_cfs_rq_work() to
pick the next p; then the new p is found again on a throttled
cfs_rq). Before the above change, the entire cfs_rq's corresponding
sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)

So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
cgroup. The results show that there is not much impact in terms of
wakeup latency (considering the standard deviation). Based on the data
and my understanding, for this series,

Tested-by: Chen Yu <yu.c.chen@...el.com>


Tested script parameters are borrowed from the previous attached ones:
#!/bin/bash

if [ $# -ne 1 ]; then
         echo "please provide cgroup level"
         exit
fi

N=$1
current_path="/sys/fs/cgroup"

for ((i=1; i<=N; i++)); do
         new_dir="${current_path}/${i}"
         mkdir -p "$new_dir"
         if [ "$i" -ne "$N" ]; then
                 echo '+cpu +memory +pids' > 
${new_dir}/cgroup.subtree_control
         fi
         current_path="$new_dir"
done

echo "current_path:${current_path}"
echo "1600000 100000" > ${current_path}/cpu.max
echo "34G" > ${current_path}/memory.max

echo $$ > ${current_path}/cgroup.procs
#./run-mmtests.sh --no-monitor --config config-schbench baseline
./run-mmtests.sh --no-monitor --config config-schbench sch_throt


pids=$(cat "${current_path}/cgroup.procs")
for pid in $pids; do
         echo $pid > "/sys/fs/cgroup/cgroup.procs" 2>/dev/null
done
for ((i=N; i>=1; i--)); do
         rmdir ${current_path}
         current_path=$(dirname "$current_path")
done


Results:

schbench thread = 1
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
//the baseline's std% is 35%, the change should not be a problem
Wakeup Latencies 99.0th        15.00(5.29)          17.00(1.00) 
-13.33%
Request Latencies 99.0th       3830.67(33.31)       3854.67(25.72) 
-0.63%
RPS 50.0th                     1598.00(4.00)        1606.00(4.00) 
+0.50%
Average RPS                    1597.77(5.13)        1606.11(4.75) 
+0.52%

schbench thread = 2
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        18.33(0.58)          18.67(0.58) 
-1.85%
Request Latencies 99.0th       3868.00(49.96)       3854.67(44.06) 
+0.34%
RPS 50.0th                     3185.33(4.62)        3204.00(8.00) 
+0.59%
Average RPS                    3186.49(2.70)        3204.21(11.25) 
+0.56%

schbench thread = 4
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        19.33(1.15)          19.33(0.58) 
0.00%
Request Latencies 99.0th       35690.67(517.31)     35946.67(517.31) 
-0.72%
RPS 50.0th                     4418.67(18.48)       4434.67(9.24) 
+0.36%
Average RPS                    4414.38(16.94)       4436.02(8.77) 
+0.49%

schbench thread = 8
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        22.67(0.58)          22.33(0.58) 
+1.50%
Request Latencies 99.0th       73002.67(147.80)     72661.33(147.80) 
+0.47%
RPS 50.0th                     4376.00(16.00)       4392.00(0.00) 
+0.37%
Average RPS                    4373.89(15.04)       4393.88(6.22) 
+0.46%

schbench thread = 16
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        29.00(2.65)          29.00(3.61) 
0.00%
Request Latencies 99.0th       88704.00(0.00)       88704.00(0.00) 
0.00%
RPS 50.0th                     4274.67(24.44)       4290.67(9.24) 
+0.37%
Average RPS                    4277.27(24.80)       4287.97(9.80) 
+0.25%

schbench thread = 32
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        100.00(22.61)        82.00(16.46) 
+18.00%
Request Latencies 99.0th       100138.67(295.60)    100053.33(147.80) 
+0.09%
RPS 50.0th                     3942.67(20.13)       3916.00(42.33) 
-0.68%
Average RPS                    3919.39(19.01)       3892.39(42.26) 
-0.69%

schbench thread = 63
Metric                         Base (mean±std)      Compare (mean±std) 
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        94848.00(0.00)       94336.00(0.00) 
+0.54%
//the baseline's std% is 19%, the change should not be a problem
Request Latencies 99.0th       264618.67(51582.78)  298154.67(591.21) 
-12.67%
RPS 50.0th                     2641.33(4.62)        2628.00(8.00) 
-0.50%
Average RPS                    2659.49(8.88)        2636.17(7.58) 
-0.88%

thanks,
Chenyu