[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250903101102.GB42@bytedance>
Date: Wed, 3 Sep 2025 18:11:02 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Valentin Schneider <vschneid@...hat.com>,
Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>,
Songtang Liu <liusongtang@...edance.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
Hi Prateek,
On Wed, Sep 03, 2025 at 02:41:55PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 9/3/2025 12:44 PM, Aaron Lu wrote:
> > On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
> >> With this said, I reduced the task number and retested on this 2nd AMD
> >> Genoa:
> >> - quota set to 50 cpu for each level1 cgroup;
>
> What exactly is the quota and period when you say 50cpu?
period is the default 100000 and quota is set to 5000000.
>
> >> - using only 1 fd pair, i.e. 2 task for each cgroup:
> >> hackbench -p -g 1 -f 1 -l 50000000
> >> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
> >> task number is 2 * 200 = 400 tasks.
> >>
> >> base head diff
> >> Time 127.77±2.60% 127.49±2.63% noise
> >>
> >> In this setup, performance is about the same.
> >>
> >> Now I'm wondering why on Intel EMR, running that extreme setup(8000
> >> tasks), performance of task based throttle didn't see noticeable drop...
> >
> > Looks like hackbench doesn't like task migration on this AMD system
> > (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).
> >
> > If I revert patch5, running this 40 * 200 = 8000 hackbench workload
> > again, performance is roughly the same now(head~1 is slightly worse but
> > given the 4+% stddev in base, it can be considered in noise range):
> >
> > base head~1(patch1-4) diff head(patch1-5)
> > Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71%
> >
> > According to /proc/schedstat, the lb_gained for domain2 is:
> >
> > NOT_IDLE IDLE NEWLY_IDLE
> > base 0 8052 81791
> > head~1 0 7197 175096
> > head 1 14818 793065
>
> Since these are mostly idle and newidle balance, I wonder if we can run
> into a scenario where,
>
> 1. All the tasks are throttled.
> 2. CPU turning idle does a newidle balance.
> 3. CPU pulls a tasks from throttled hierarchy and selects it.
> 4. The task exits to user space and is dequeued.
> 5. Goto 1.
>
> and when the CPU is unthrottled, it has a large number of tasks on it
> that'll again require a load balance to even stuff out.
>
I think it is because we allow balancing tasks under a throttled
hirarchy that made the balance number much larger.
> >
> > Other domains have similar number: base has smallest migration number
> > while head has the most and head~1 reduce the number a lot. I suppose
> > this is expected, because we removed the throttled_lb_pair() restriction
> > in patch5 and that can cause runnable tasks in throttled hierarchy to be
> > balanced to other cpus while in base, this can not happen.
> >
> > I think patch5 still makes sense and is correct, it's just this specific
> > workload doesn't like task migrations. Intel EMR doesn't suffer from
> > this, I suppose that's because EMR has a much larger LLC while AMD Genoa
> > has a relatively small LLC and task migrations across LLC boundary hurts
> > hackbench's performance.
>
> I think we can leave the throttled_lb_pair() condition as is and revisit
> it later if this is visible in real world workloads. I cannot think of
> any easy way to avoid the case for potential pileup without accounting
> for the throttled tasks in limbo except for something like below at
> head~1:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bdc9bfa0b9ef..3dc807af21ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9385,7 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> /*
> * We do not migrate tasks that are:
> * 1) delayed dequeued unless we migrate load, or
> - * 2) throttled_lb_pair, or
> + * 2) throttled_lb_pair unless we migrate load, or
> * 3) cannot be migrated to this CPU due to cpus_ptr, or
> * 4) running (obviously), or
> * 5) are cache-hot on their current CPU, or
> @@ -9394,7 +9394,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> return 0;
>
> - if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu) &&
> + env->migration_type != migrate_load)
> return 0;
>
> /*
> ---
>
> Since load_avg moves slowly, it might be enough to avoid pileup of
> tasks. This is similar to the condition for migrating delayed tasks
> above but unlike the hierarchies of delayed tasks, the weight of
> throttled hierarchy does change when throttled tasks are transitioned to
> limbo so this needs some more staring at.
>
I was thinking: should we not allow task balancing to a throttled target
cfs_rq? For task based throttle model, if a task is on rq's cfs_tasks
list, it is allowed to run so we should not check src cfs_rq's throttle
status but we should check if the target cfs_rq is throttled and if it is,
then it's probably not very useful to do the balance. I tried below diff
and the performance is restored:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e1..3e927b9b7eeb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9370,6 +9370,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
+ if (throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
+ return 0;
+
/*
* We want to prioritize the migration of eligible tasks.
* For ineligible tasks we soft-limit them and only allow
base head' diff head(patch1-5)
Time 82.55±4.82% 83.81±2.89% -1.5% 99.69±6.71%
head': head + above diff
I also tested netperf on this AMD system as well as hackbench and
netperf on Intel EMR, no obvious performance difference observed
after applying the above diff, i.e. base and head' performance is
roughly the same.
Does the above diff make sense? One thing I'm slightly concerned is,
there may be one case when balancing a task to a throttled target
cfs_rq makes sense: if the task holds some kernel resource and is
running inside kernel, even its target cfs_rq is throttled, we still
want this task to go there and finish its job in kernel mode sooner,
this could help other resource waiters. But, this may not be a big deal
and in most of the time, balancing a task to a throttled cfs_rq doesn't
look like a meaningful thing to do.
Best regards,
Aaron
Powered by blists - more mailing lists