[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <091879a2-c5a0-2512-ce43-ae2efa869f5b@bytedance.com>
Date: Thu, 24 Feb 2022 22:36:39 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: "Gautham R. Shenoy" <gautham.shenoy@....com>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
linux-kernel@...r.kernel.org, srikar@...ux.vnet.ibm.com,
aubrey.li@...el.com, Abel Wu <wuyun.abel@...edance.com>
Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
Hi Gautham, thanks for your comment!
On 2/24/22 3:10 PM, Gautham R. Shenoy wrote:
> Hello Abel,
>
> (+ Aubrey Li, Srikar)
>
> On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
>> An CFS runqueue is considered overloaded when there are
>> more than one pullable non-idle tasks on it (since sched-
>> idle cpus are treated as idle cpus). And idle tasks are
>> counted towards rq->cfs.idle_h_nr_running, that is either
>> assigned SCHED_IDLE policy or placed under idle cgroups.
>>
>> The overloaded cfs rqs can cause performance issues to
>> both task types:
>>
>> - for latency critical tasks like SCHED_NORMAL,
>> time of waiting in the rq will increase and
>> result in higher pct99 latency, and
>>
>> - batch tasks may not be able to make full use
>> of cpu capacity if sched-idle rq exists, thus
>> presents poorer throughput.
>>
>> The mask of overloaded cpus is updated in periodic tick
>> and the idle path at the LLC domain basis. This cpumask
>> will also be used in SIS as a filter, improving idle cpu
>> searching.
>
> This is an interesting approach to minimise the tail latencies by
> keeping track of the overloaded cpus in the LLC so that
> idle/sched-idle CPUs can pull from them. This approach contrasts with the
> following approaches that were previously tried :
>
> 1. Maintain the idle cpumask at the LLC level by Aubrey Li
> https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
It's a similar approach from different sight in SIS. Both have pros and
cons, and I couldn't tell which one is more appropriate..
While since SIS can fail in finding one idle cpu due to scaling issues,
the sched-idle load balancing might be a valuable supplement to that to
consume the idle/sched-idle cpus.
>
> 2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
> https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/
The efforts done by Srikar seems focused on idle core searching, which
has a different goal from my approach I think.
The case of short running tasks pointed out by Vincent should not be a
problem in updating the overloaded cpu mask/counter, since they are not
updated either when cpu becomes busy, or when cpu frequently goes idle
during a tick period.
>
> There have been concerns in the past about having to update the shared
> mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
> ?
>
I'm afraid I didn't fully catch up with these loops, it is appreciated
if someone can shed some light, thanks!
>
>
>>
>> Signed-off-by: Abel Wu <wuyun.abel@...edance.com>
>> ---
>> include/linux/sched/topology.h | 10 ++++++++++
>> kernel/sched/core.c | 1 +
>> kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 6 ++++++
>> kernel/sched/topology.c | 4 +++-
>> 5 files changed, 63 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..03c9c81dc886 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,6 +81,16 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> +
>> + /*
>> + * The above varibles are used in idle path and
>> + * select_task_rq, and the following two are
>> + * mainly updated in tick. They are all hot but
>> + * for different usage, so start a new cacheline
>> + * to avoid false sharing.
>> + */
>> + atomic_t nr_overloaded ____cacheline_aligned;
>> + unsigned long overloaded[]; /* Must be last */
>> };
>>
>> struct sched_domain {
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1d863d7f6ad7..a6da2998ec49 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
>> rq->wake_stamp = jiffies;
>> rq->wake_avg_idle = rq->avg_idle;
>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> + rq->overloaded = 0;
>>
>> INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5c4bfffe8c2c..0a0438c3319b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>>
>> return newidle_balance(rq, rf) != 0;
>> }
>> +
>> +static inline int cfs_rq_overloaded(struct rq *rq)
>> +{
>> + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
>> +}
>> +
>> +/* Must be called with rq locked */
>> +static void update_overload_status(struct rq *rq)
>> +{
>> + struct sched_domain_shared *sds;
>> + int overloaded = cfs_rq_overloaded(rq);
>> + int cpu = cpu_of(rq);
>> +
>> + lockdep_assert_rq_held(rq);
>> +
>> + if (rq->overloaded == overloaded)
>> + return;
>> +
>> + rcu_read_lock();
>> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> + if (unlikely(!sds))
>> + goto unlock;
>> +
>> + if (overloaded) {
>> + cpumask_set_cpu(cpu, sdo_mask(sds));
>> + atomic_inc(&sds->nr_overloaded);
>> + } else {
>> + cpumask_clear_cpu(cpu, sdo_mask(sds));
>> + atomic_dec(&sds->nr_overloaded);
>> + }
>> +
>> + rq->overloaded = overloaded;
>> +unlock:
>> + rcu_read_unlock();
>> +}
>> +
>> +#else
>> +
>> +static inline void update_overload_status(struct rq *rq) { }
>> +
>> #endif /* CONFIG_SMP */
>>
>> static unsigned long wakeup_gran(struct sched_entity *se)
>> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
>> if (new_tasks > 0)
>> goto again;
>>
>> + update_overload_status(rq);
>> +
>
> So here, we are calling update_overload_status() after
> newidle_balance(). If we had pulled a single task as a part of
> newidle_balance(), in your current code, we do not update the overload
> status. While this should get remedied in the next tick, should we
> move update_overload_status(rq) prior to the new_tasks > 0 check ?
A single task won't change the overloaded status :)
And I think it would be better not do an update even if pulled several
tasks because that would break the rate limit which is undesired.
Best Regards,
Abel
>
>
> --
> Thanks and Regards
> gautham.
Powered by blists - more mailing lists