[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4efdc1a8-b624-4857-93cb-c40da6252983@intel.com>
Date: Sun, 17 Aug 2025 16:50:50 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: <linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
<liusongtang@...edance.com>, Valentin Schneider <vschneid@...hat.com>, "Ben
Segall" <bsegall@...gle.com>, K Prateek Nayak <kprateek.nayak@....com>,
"Peter Zijlstra" <peterz@...radead.org>, Chengming Zhou
<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
<mingo@...hat.com>, "Vincent Guittot" <vincent.guittot@...aro.org>, Xi Wang
<xii@...gle.com>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
On 7/15/2025 3:16 PM, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@...hat.com>
>
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
>
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
>
> Throttled cfs_rq's PELT clock is handled differently now: previously the
> cfs_rq's PELT clock is stopped once it entered throttled state but since
> now tasks(in kernel mode) can continue to run, change the behaviour to
> stop PELT clock only when the throttled cfs_rq has no tasks left.
>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
> Suggested-by: Chengming Zhou <chengming.zhou@...ux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@...hat.com>
> Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
> ---
[snip]
> @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> {
> struct sched_entity *se;
> struct cfs_rq *cfs_rq;
> + struct task_struct *p;
> + bool throttled;
>
> again:
> cfs_rq = &rq->cfs;
> if (!cfs_rq->nr_queued)
> return NULL;
>
> + throttled = false;
> +
> do {
> /* Might not have done put_prev_entity() */
> if (cfs_rq->curr && cfs_rq->curr->on_rq)
> update_curr(cfs_rq);
>
> - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> - goto again;
> + throttled |= check_cfs_rq_runtime(cfs_rq);
>
> se = pick_next_entity(rq, cfs_rq);
> if (!se)
> @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
> - return task_of(se);
> + p = task_of(se);
> + if (unlikely(throttled))
> + task_throttle_setup_work(p);
> + return p;
> }
>
Previously, I was wondering if the above change might impact
wakeup latency in some corner cases: If there are many tasks
enqueued on a throttled cfs_rq, the above pick-up mechanism
might return an invalid p repeatedly (where p is dequeued,
and a reschedule is triggered in throttle_cfs_rq_work() to
pick the next p; then the new p is found again on a throttled
cfs_rq). Before the above change, the entire cfs_rq's corresponding
sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
cgroup. The results show that there is not much impact in terms of
wakeup latency (considering the standard deviation). Based on the data
and my understanding, for this series,
Tested-by: Chen Yu <yu.c.chen@...el.com>
Tested script parameters are borrowed from the previous attached ones:
#!/bin/bash
if [ $# -ne 1 ]; then
echo "please provide cgroup level"
exit
fi
N=$1
current_path="/sys/fs/cgroup"
for ((i=1; i<=N; i++)); do
new_dir="${current_path}/${i}"
mkdir -p "$new_dir"
if [ "$i" -ne "$N" ]; then
echo '+cpu +memory +pids' >
${new_dir}/cgroup.subtree_control
fi
current_path="$new_dir"
done
echo "current_path:${current_path}"
echo "1600000 100000" > ${current_path}/cpu.max
echo "34G" > ${current_path}/memory.max
echo $$ > ${current_path}/cgroup.procs
#./run-mmtests.sh --no-monitor --config config-schbench baseline
./run-mmtests.sh --no-monitor --config config-schbench sch_throt
pids=$(cat "${current_path}/cgroup.procs")
for pid in $pids; do
echo $pid > "/sys/fs/cgroup/cgroup.procs" 2>/dev/null
done
for ((i=N; i>=1; i--)); do
rmdir ${current_path}
current_path=$(dirname "$current_path")
done
Results:
schbench thread = 1
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
//the baseline's std% is 35%, the change should not be a problem
Wakeup Latencies 99.0th 15.00(5.29) 17.00(1.00)
-13.33%
Request Latencies 99.0th 3830.67(33.31) 3854.67(25.72)
-0.63%
RPS 50.0th 1598.00(4.00) 1606.00(4.00)
+0.50%
Average RPS 1597.77(5.13) 1606.11(4.75)
+0.52%
schbench thread = 2
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 18.33(0.58) 18.67(0.58)
-1.85%
Request Latencies 99.0th 3868.00(49.96) 3854.67(44.06)
+0.34%
RPS 50.0th 3185.33(4.62) 3204.00(8.00)
+0.59%
Average RPS 3186.49(2.70) 3204.21(11.25)
+0.56%
schbench thread = 4
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 19.33(1.15) 19.33(0.58)
0.00%
Request Latencies 99.0th 35690.67(517.31) 35946.67(517.31)
-0.72%
RPS 50.0th 4418.67(18.48) 4434.67(9.24)
+0.36%
Average RPS 4414.38(16.94) 4436.02(8.77)
+0.49%
schbench thread = 8
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 22.67(0.58) 22.33(0.58)
+1.50%
Request Latencies 99.0th 73002.67(147.80) 72661.33(147.80)
+0.47%
RPS 50.0th 4376.00(16.00) 4392.00(0.00)
+0.37%
Average RPS 4373.89(15.04) 4393.88(6.22)
+0.46%
schbench thread = 16
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 29.00(2.65) 29.00(3.61)
0.00%
Request Latencies 99.0th 88704.00(0.00) 88704.00(0.00)
0.00%
RPS 50.0th 4274.67(24.44) 4290.67(9.24)
+0.37%
Average RPS 4277.27(24.80) 4287.97(9.80)
+0.25%
schbench thread = 32
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 100.00(22.61) 82.00(16.46)
+18.00%
Request Latencies 99.0th 100138.67(295.60) 100053.33(147.80)
+0.09%
RPS 50.0th 3942.67(20.13) 3916.00(42.33)
-0.68%
Average RPS 3919.39(19.01) 3892.39(42.26)
-0.69%
schbench thread = 63
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 94848.00(0.00) 94336.00(0.00)
+0.54%
//the baseline's std% is 19%, the change should not be a problem
Request Latencies 99.0th 264618.67(51582.78) 298154.67(591.21)
-12.67%
RPS 50.0th 2641.33(4.62) 2628.00(8.00)
-0.50%
Average RPS 2659.49(8.88) 2636.17(7.58)
-0.88%
thanks,
Chenyu
Powered by blists - more mailing lists