[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2db9ab21-e1ce-694a-f509-5600f1190d75@bytedance.com>
Date: Tue, 10 May 2022 16:03:56 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Josh Don <joshdon@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Vincent Guittot <vincent.guittot@...aro.org>,
linux-kernel <linux-kernel@...r.kernel.org>,
Abel Wu <wuyun.abel@...edance.com>,
Chen Yu <yu.c.chen@...el.com>
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS
Hi Josh,
On 5/10/22 9:14 AM, Josh Don Wrote:
> Hi Abel,
>
> Overall this looks good, just a couple of comments.
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>> {
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> + struct sched_domain_shared *sds = sd->shared;
>> + int nr, nro, weight = sd->span_weight;
>> + int i, cpu, idle_cpu = -1;
>> struct rq *this_rq = this_rq();
>> int this = smp_processor_id();
>> struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> if (!this_sd)
>> return -1;
>>
>> + nro = atomic_read(&sds->nr_overloaded_cpus);
>> + if (nro == weight)
>> + goto out;
>
> This assumes that the sd we're operating on here is the LLC domain
> (true for current use). Perhaps to catch future bugs from changing
> this assumption, we could WARN_ON_ONCE(nro > weight).
The @sds comes from sd->shared, so I don't think the condition will
break once we operate at other level domains. But a quick check on
sds != NULL may be needed then since domains can have no sds attached.
>
>> +
>> + nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> + /*
>> + * It's unlikely to find an idle cpu if the system is under
>> + * heavy pressure, so skip searching to save a few cycles
>> + * and relieve cache traffic.
>> + */
>> + if (weight - nro < (nr >> 4) && !has_idle_core)
>> + return -1;
>
> nit: nr / 16 is easier to read and the compiler will do the shifting for you.
Agreed.
>
> Was < intentional vs <= ? With <= you'll be able to skip the search in
> the case where both sides evaluate to 0 (can happen frequently if we
> have no idle cpus, and a task with a small affinity mask).
It's intentional, the idea is to unconditionally pass when there are
less than 16 cpus to search which seems scalability is not an issue.
But I made a mistake that (weight - nro) couldn't be 0 here, so it's
not appropriate to use "<".
BTW, I think Chen Yu's proposal[1] on search depth limitation is a
better idea and more reasonable. And he is doing some benchmark on
the mixture of our work.
[1]
https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@intel.com/
>
> This will also get a bit confused in the case where the task has many
> cpus allowed, but almost all of them on a different LLC than the one
> we're considering here. Apart from caching the per-LLC
> nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
> only do this in the !has_idle_core case to reduce calls to
> cpumask_weight()).
Yes the task might have many cpus allowed on another LLC, the idea is
to use @nr as a worst case boundary. And with Chen's work, I think we
can get rid of nr_cpus_allowed.
>
>> +
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + if (nro > 1)
>> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
>
> Just
> if (nro)
> ?
I think it's just not worthy to touch sdo_mask(sds) which causes heavy
cache traffic, if it only contains one cpu.
>
>> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>
>> update_avg(&this_sd->avg_scan_cost, time);
>> }
>> +out:
>> + if (has_idle_core)
>> + WRITE_ONCE(sds->has_idle_cores, 0);
>
> nit: use set_idle_cores() instead (or, if you really want to avoid the
> extra sds dereference, add a __set_idle_cores(sds, val) helper you can
> call directly.
OK, will do.
>
>> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>> continue;
>>
>> detach_task(p, env);
>> + update_overloaded_rq(env->src_rq);
>>
>> /*
>> * Right now, this is only the second place where
>> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>> list_move(&p->se.group_node, tasks);
>> }
>>
>> + if (detached)
>> + update_overloaded_rq(env->src_rq);
>> +
>
> Thinking about this more, I don't see an issue with moving the
> update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
> in the attach/detach_task paths. Overloaded state only changes when we
> pass the boundary of 2 runnable non-idle tasks, so thashing of the
> overloaded mask is a lot less worrisome than if it were updated on the
> boundary of 1 runnable task. The attach/detach_task paths run as part
> of load balancing, which can be on a millisecond time scale.
It's really hard to say which one is better, and I think it's more like
workload-specific. It's common in our cloud servers that a long running
workload co-exists with a short running workload which could flip the
status frequently.
Thanks & BR,
Abel
Powered by blists - more mailing lists