[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c15b1322-074d-b400-bc7b-2a0031862fdd@linux.vnet.ibm.com>
Date: Tue, 18 Jul 2023 20:22:24 +0530
From: Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>
To: Tobias Huschle <huschle@...ux.ibm.com>,
Tim Chen <tim.c.chen@...ux.intel.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Ricardo Neri <ricardo.neri@...el.com>,
"Ravi V . Shankar" <ravi.v.shankar@...el.com>,
Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Len Brown <len.brown@...el.com>, Mel Gorman <mgorman@...e.de>,
"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
Steven Rostedt <rostedt@...dmis.org>,
Valentin Schneider <vschneid@...hat.com>,
Ionela Voinescu <ionela.voinescu@....com>, x86@...nel.org,
linux-kernel@...r.kernel.org,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
naveen.n.rao@...ux.vnet.ibm.com,
Yicong Yang <yangyicong@...ilicon.com>,
Barry Song <v-songbaohua@...o.com>,
Chen Yu <yu.c.chen@...el.com>, Hillf Danton <hdanton@...a.com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [Patch v3 1/6] sched/fair: Determine active load balance for SMT
sched groups
On 7/18/23 11:37 AM, Tobias Huschle wrote:
> On 2023-07-15 01:05, Tim Chen wrote:
>> On Fri, 2023-07-14 at 18:36 +0530, Shrikanth Hegde wrote:
>>
>>>
>>>
>>> If we consider symmetric platforms which have SMT4 such as power10.
>>> we have a topology like below. multiple such MC will form DIE(PKG)
>>>
>>>
>>> [0 2 4 6][1 3 5 7][8 10 12 14][9 11 13 15]
>>> [--SMT--][--SMT--][----SMT---][---SMT----]
>>> [--sg1--][--sg1--][---sg1----][---sg1----]
>>> [--------------MC------------------------]
>>>
>>> In case of SMT4, if there is any group which has 2 or more tasks, that
>>> group will be marked as group_smt_balance. previously, if that group
>>> had 2
>>> or 3 tasks, it would have been marked as group_has_spare. Since all
>>> the groups have
>>> SMT that means behavior would be same fully busy right? That can
>>> cause some
>>> corner cases. No?
>>
>> You raised a good point. I was looking from SMT2
>> perspective so group_smt_balance implies group_fully_busy.
>> That is no longer true for SMT4.
>>
>> I am thinking of the following fix on the current patch
>> to take care of SMT4. Do you think this addresses
>> concerns from you and Tobias?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 294a662c9410..3fc8d3a3bd22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9588,6 +9588,17 @@ static bool update_sd_pick_busiest(struct
>> lb_env *env,
>> break;
>>
>> case group_smt_balance:
>> + /* no idle cpus on both groups handled by
>> group_fully_busy below */
>> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>> + if (sgs->idle_cpus > busiest->idle_cpus)
>> + return false;
>> + if (sgs->idle_cpus < busiest->idle_cpus)
>> + return true;
>> + if (sgs->sum_nr_running <=
>> busiest_sum_nr_running)
>> + return false;
>> + else
>> + return true;
>> + }
>>
>>
>> I will be on vacation next three weeks so my response will be slow.
>>
>> Tim
>>
>
> What if the setup is asymmetric, where SMT2 and SMT4 would mix, e.g.
>
> [0 1][2 3 4 5]
> [SMT][--SMT--]
>
> If now CPUs 0,2,3 have a running task, both groups would be classified as
> smt_balance. But if it comes to the selection of the busiest group, the
> smaller
> group would be selected, as it has less idle CPUs, right? Which could lead
> to the smaller group being left with no tasks.
> Using the absolute numbers of task is what made the prefer_sibling path
> problematic,
Yes. But Not sure how realistic is that configuration. on power10, we typically
have all cores in either SMT1, SMT2 or SMT4. But not mixed configs.
One can offline a CPUs to get into that cases in SMT4.
> I would assume that the same holds true here. Therefore, I would prefer
> avg_load,
> or, similar to prefer_siblings, a ratio over the number of cores.
>
> I can't really test that on s390 as we always have SMT2. But, we can
> have these
> asymmetries on higher levels, e.g
IIUC, on higher levels, group will not have SD_SHARE_CPUCAPACITY, so it shouldn't
run into group_smt_balance.
>
> [0 1][2 3][4 5][6 7][8 9]
> [SMT][SMT][SMT][SMT][SMT]
> [-----core----][--core--]
>
> For large configurations this can be true for even higher levels.
> Therefore, the idea was to move the smt_balance state around and adapt its
> conditions to something like this (which would require to reorder the
> commits):
>
> @@ -8330,6 +8330,11 @@ enum fbq_type { regular, remote, all };
> enum group_type {
> /* The group has spare capacity that can be used to run more
> tasks. */
> group_has_spare = 0,
> + /*
> + * Balance SMT group that's fully busy. Can benefit from migration
> + * a task on SMT with busy sibling to another CPU on idle core.
> + */
> + group_smt_balance,
> /*
> * The group is fully used and the tasks don't compete for more CPU
> * cycles. Nevertheless, some tasks might wait before running.
> @@ -8340,11 +8345,6 @@ enum group_type {
> * more powerful CPU.
> */
> group_misfit_task,
> - /*
> - * Balance SMT group that's fully busy. Can benefit from migration
> - * a task on SMT with busy sibling to another CPU on idle core.
> - */
> - group_smt_balance,
> /*
> * SD_ASYM_PACKING only: One local CPU with higher capacity is
> available,
IIUC, for cluster topology of this patch, busiest group should be a SMT if it has 2
threads compared to an Atom cluster having 4 threads. Atom cluster will be group_fully_busy,
whereas SMT group will be group_smt_balance. For that to happen group_smt_balance should have
higher group_type.
> * and the task should be migrated to it instead of running on the
> @@ -9327,15 +9327,15 @@ group_type group_classify(unsigned int
> imbalance_pct,
> if (sgs->group_asym_packing)
> return group_asym_packing;
>
> - if (sgs->group_smt_balance)
> - return group_smt_balance;
> -
> if (sgs->group_misfit_task_load)
> return group_misfit_task;
>
> if (!group_has_capacity(imbalance_pct, sgs))
> return group_fully_busy;
>
> + if (sgs->group_smt_balance)
> + return group_smt_balance;
> +
> return group_has_spare;
> }
>
> @@ -9457,8 +9457,7 @@ static inline bool smt_balance(struct lb_env *env,
> struct sg_lb_stats *sgs,
> * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY
> * will not be on.
> */
> - if (group->flags & SD_SHARE_CPUCAPACITY &&
> - sgs->sum_h_nr_running > 1)
> + if (sgs->sum_h_nr_running > group->cores)
In case of Power10, where we have SMT4, group->cores will be 1. I dont see
a difference here.
> return true;
>
> return false;
>
> The s390 problem is currently solved by changing the prefer_sibling
> path. When
> disabling that flag, we might have an issue, will have to verify that
> though.
>
>>>
>>> One example is Lets say sg1 has 4 tasks. and sg2 has 0 tasks and is
>>> trying to do
>>> load balance. Previously imbalance would have been 2, instead now
>>> imbalance would be 1.
>>> But in subsequent lb it would be balanced.
>>>
>>>
>>>
>>> > + return false;
>>> > +}
>>> > +
>>> > static inline bool
>>> > sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>> > {
>>> > @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct
>>> lb_env *env,
>>> > sgs->group_asym_packing = 1;
>>> > }
>>> >
>>> > + /* Check for loaded SMT group to be balanced to dst CPU */
>>> > + if (!local_group && smt_balance(env, sgs, group))
>>> > + sgs->group_smt_balance = 1;
>>> > +
>>> > sgs->group_type = group_classify(env->sd->imbalance_pct,
>>> group, sgs);
>>> >
>>> > /* Computing avg_load makes sense only when group is
>>> overloaded */
>>> > @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> > return false;
>>> > break;
>>> >
>>> > + case group_smt_balance:
>>> > case group_fully_busy:
>>> > /*
>>> > * Select the fully busy group with highest avg_load. In
>>> > @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct
>>> lb_env *env,
>>> > break;
>>> >
>>> > case group_has_spare:
>>> > + /*
>>> > + * Do not pick sg with SMT CPUs over sg with pure CPUs,
>>> > + * as we do not want to pull task off SMT core with one task
>>> > + * and make the core idle.
>>> > + */
>>> > + if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
>>> > + if (sg->flags & SD_SHARE_CPUCAPACITY &&
>>> sgs->sum_h_nr_running <= 1)
>>> > + return false;
>>> > + else
>>> > + return true;> + }
>>> > +
>>> > /*
>>> > * Select not overloaded group with lowest number of idle
>>> cpus
>>> > * and highest number of running tasks. We could also compare
>>> > @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct
>>> sched_group *idlest,
>>> >
>>> > case group_imbalanced:
>>> > case group_asym_packing:
>>> > + case group_smt_balance:
>>> > /* Those types are not used in the slow wakeup path */
>>> > return false;
>>> >
>>> > @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd,
>>> struct task_struct *p, int this_cpu)
>>> >
>>> > case group_imbalanced:
>>> > case group_asym_packing:
>>> > + case group_smt_balance:
>>> > /* Those type are not used in the slow wakeup path */
>>> > return NULL;
>>> >
>>> > @@ -10118,6 +10176,13 @@ static inline void
>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>> > return;
>>> > }
>>> >
>>> > + if (busiest->group_type == group_smt_balance) {
>>> > + /* Reduce number of tasks sharing CPU capacity */
>>> > + env->migration_type = migrate_task;
>>> > + env->imbalance = 1;
>>> > + return;
>>> > + }
>>> > +
>>> > if (busiest->group_type == group_imbalanced) {
>>> > /*
>>> > * In the group_imb case we cannot rely on group-wide
>>> averages
>>> > @@ -10363,16 +10428,23 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> > goto force_balance;
>>> >
>>> > if (busiest->group_type != group_overloaded) {
>>> > - if (env->idle == CPU_NOT_IDLE)
>>> > + if (env->idle == CPU_NOT_IDLE) {
>>> > /*
>>> > * If the busiest group is not overloaded (and as a
>>> > * result the local one too) but this CPU is already
>>> > * busy, let another idle CPU try to pull task.
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> > +
>>> > + if (busiest->group_type == group_smt_balance &&
>>> > + smt_vs_nonsmt_groups(sds.local, sds.busiest)) {
>>> > + /* Let non SMT CPU pull from SMT CPU sharing with
>>> sibling */
>>> > + goto force_balance;
>>> > + }
>>> >
>>> > if (busiest->group_weight > 1 &&
>>> > - local->idle_cpus <= (busiest->idle_cpus + 1))
>>> > + local->idle_cpus <= (busiest->idle_cpus + 1)) {
>>> > /*
>>> > * If the busiest group is not overloaded
>>> > * and there is no imbalance between this and busiest
>>> > @@ -10383,12 +10455,14 @@ static struct sched_group
>>> *find_busiest_group(struct lb_env *env)
>>> > * there is more than 1 CPU per group.
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> >
>>> > - if (busiest->sum_h_nr_running == 1)
>>> > + if (busiest->sum_h_nr_running == 1) {
>>> > /*
>>> > * busiest doesn't have any tasks waiting to run
>>> > */
>>> > goto out_balanced;
>>> > + }
>>> > }
>>> >
>>> > force_balance:
Powered by blists - more mailing lists