[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b2e91e7-0514-4728-a3a3-96282a0d4286@amd.com>
Date: Wed, 3 Sep 2025 14:41:55 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>, Valentin Schneider
<vschneid@...hat.com>
CC: Ben Segall <bsegall@...gle.com>, Peter Zijlstra <peterz@...radead.org>,
Chengming Zhou <chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>,
Ingo Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, Steven
Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, Florian
Bezdeka <florian.bezdeka@...mens.com>, Songtang Liu
<liusongtang@...edance.com>, Sebastian Andrzej Siewior
<bigeasy@...utronix.de>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
Hello Aaron,
On 9/3/2025 12:44 PM, Aaron Lu wrote:
> On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
>> With this said, I reduced the task number and retested on this 2nd AMD
>> Genoa:
>> - quota set to 50 cpu for each level1 cgroup;
What exactly is the quota and period when you say 50cpu?
>> - using only 1 fd pair, i.e. 2 task for each cgroup:
>> hackbench -p -g 1 -f 1 -l 50000000
>> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
>> task number is 2 * 200 = 400 tasks.
>>
>> base head diff
>> Time 127.77±2.60% 127.49±2.63% noise
>>
>> In this setup, performance is about the same.
>>
>> Now I'm wondering why on Intel EMR, running that extreme setup(8000
>> tasks), performance of task based throttle didn't see noticeable drop...
>
> Looks like hackbench doesn't like task migration on this AMD system
> (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).
>
> If I revert patch5, running this 40 * 200 = 8000 hackbench workload
> again, performance is roughly the same now(head~1 is slightly worse but
> given the 4+% stddev in base, it can be considered in noise range):
>
> base head~1(patch1-4) diff head(patch1-5)
> Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71%
>
> According to /proc/schedstat, the lb_gained for domain2 is:
>
> NOT_IDLE IDLE NEWLY_IDLE
> base 0 8052 81791
> head~1 0 7197 175096
> head 1 14818 793065
Since these are mostly idle and newidle balance, I wonder if we can run
into a scenario where,
1. All the tasks are throttled.
2. CPU turning idle does a newidle balance.
3. CPU pulls a tasks from throttled hierarchy and selects it.
4. The task exits to user space and is dequeued.
5. Goto 1.
and when the CPU is unthrottled, it has a large number of tasks on it
that'll again require a load balance to even stuff out.
>
> Other domains have similar number: base has smallest migration number
> while head has the most and head~1 reduce the number a lot. I suppose
> this is expected, because we removed the throttled_lb_pair() restriction
> in patch5 and that can cause runnable tasks in throttled hierarchy to be
> balanced to other cpus while in base, this can not happen.
>
> I think patch5 still makes sense and is correct, it's just this specific
> workload doesn't like task migrations. Intel EMR doesn't suffer from
> this, I suppose that's because EMR has a much larger LLC while AMD Genoa
> has a relatively small LLC and task migrations across LLC boundary hurts
> hackbench's performance.
I think we can leave the throttled_lb_pair() condition as is and revisit
it later if this is visible in real world workloads. I cannot think of
any easy way to avoid the case for potential pileup without accounting
for the throttled tasks in limbo except for something like below at
head~1:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdc9bfa0b9ef..3dc807af21ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9385,7 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* We do not migrate tasks that are:
* 1) delayed dequeued unless we migrate load, or
- * 2) throttled_lb_pair, or
+ * 2) throttled_lb_pair unless we migrate load, or
* 3) cannot be migrated to this CPU due to cpus_ptr, or
* 4) running (obviously), or
* 5) are cache-hot on their current CPU, or
@@ -9394,7 +9394,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
- if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
+ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu) &&
+ env->migration_type != migrate_load)
return 0;
/*
---
Since load_avg moves slowly, it might be enough to avoid pileup of
tasks. This is similar to the condition for migrating delayed tasks
above but unlike the hierarchies of delayed tasks, the weight of
throttled hierarchy does change when throttled tasks are transitioned to
limbo so this needs some more staring at.
>
> I also tried to apply below hack to prove this "task migration across
> LLC boundary hurts hackbench" theory on both base and head:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c2..34c5f6b75e53d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9297,6 +9297,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> return 0;
>
> + if (!(env->sd->flags & SD_SHARE_LLC))
> + return 0;
> +
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> With this diff applied, the result is:
>
>
> base' head' diff
> Time 74.78±8.2% 78.87±15.4% -5.47%
>
> base': base + above diff
> head': head + above diff
>
> So both performs better now, but with much larger variance, I guess
> that's because no load balance on domain2 and above now. head' is still
> worse than base, but not as much as before.
>
> To conclude this: hackbench doesn't like task migration, especially when
> task is migrated across LLC boundary. patch5 removed the restriction of
> no balancing throttled tasks, this caused more balance to happen and
> hackbench doesn't like this. But balancing has its own merit and could
> still benefit other workloads so I think patch5 should stay, especially
> considering that when throttled tasks are eventually dequeued, they will
> not stay on rq's cfs_tasks list so no need to take special care for them
> when doing load balance.
Mathieu had run some experiments a couple years ago where he too
discovered reducing the number of migrations for hackbench helps but it
wasn't clear if these strategies would benefit real-world workloads:
https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
https://lore.kernel.org/lkml/20231018204511.1563390-1-mathieu.desnoyers@efficios.com/
>
> On a side note: should we increase the cost of balancing tasks out of LLC
> boundary? I tried to enlarge sysctl_sched_migration_cost 100 times for
> domains without SD_SHARE_LLC in task_hot() but that didn't help.
I'll take a look at sd->imbalance_pct and see if there is any room for
improvements there. Thank you again for the detailed analysis.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists