[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YhcvUV/jW7yr0Sn+@BLR-5CG11610CF.amd.com>
Date: Thu, 24 Feb 2022 12:40:09 +0530
From: "Gautham R. Shenoy" <gautham.shenoy@....com>
To: Abel Wu <wuyun.abel@...edance.com>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
linux-kernel@...r.kernel.org, srikar@...ux.vnet.ibm.com,
aubrey.li@...el.com
Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus
Hello Abel,
(+ Aubrey Li, Srikar)
On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
>
> The overloaded cfs rqs can cause performance issues to
> both task types:
>
> - for latency critical tasks like SCHED_NORMAL,
> time of waiting in the rq will increase and
> result in higher pct99 latency, and
>
> - batch tasks may not be able to make full use
> of cpu capacity if sched-idle rq exists, thus
> presents poorer throughput.
>
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.
This is an interesting approach to minimise the tail latencies by
keeping track of the overloaded cpus in the LLC so that
idle/sched-idle CPUs can pull from them. This approach contrasts with the
following approaches that were previously tried :
1. Maintain the idle cpumask at the LLC level by Aubrey Li
https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/
There have been concerns in the past about having to update the shared
mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
?
>
> Signed-off-by: Abel Wu <wuyun.abel@...edance.com>
> ---
> include/linux/sched/topology.h | 10 ++++++++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 43 ++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 6 ++++++
> kernel/sched/topology.c | 4 +++-
> 5 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..03c9c81dc886 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,16 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> +
> + /*
> + * The above varibles are used in idle path and
> + * select_task_rq, and the following two are
> + * mainly updated in tick. They are all hot but
> + * for different usage, so start a new cacheline
> + * to avoid false sharing.
> + */
> + atomic_t nr_overloaded ____cacheline_aligned;
> + unsigned long overloaded[]; /* Must be last */
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d863d7f6ad7..a6da2998ec49 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
> rq->wake_stamp = jiffies;
> rq->wake_avg_idle = rq->avg_idle;
> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> + rq->overloaded = 0;
>
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..0a0438c3319b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> return newidle_balance(rq, rf) != 0;
> }
> +
> +static inline int cfs_rq_overloaded(struct rq *rq)
> +{
> + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
> +}
> +
> +/* Must be called with rq locked */
> +static void update_overload_status(struct rq *rq)
> +{
> + struct sched_domain_shared *sds;
> + int overloaded = cfs_rq_overloaded(rq);
> + int cpu = cpu_of(rq);
> +
> + lockdep_assert_rq_held(rq);
> +
> + if (rq->overloaded == overloaded)
> + return;
> +
> + rcu_read_lock();
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (unlikely(!sds))
> + goto unlock;
> +
> + if (overloaded) {
> + cpumask_set_cpu(cpu, sdo_mask(sds));
> + atomic_inc(&sds->nr_overloaded);
> + } else {
> + cpumask_clear_cpu(cpu, sdo_mask(sds));
> + atomic_dec(&sds->nr_overloaded);
> + }
> +
> + rq->overloaded = overloaded;
> +unlock:
> + rcu_read_unlock();
> +}
> +
> +#else
> +
> +static inline void update_overload_status(struct rq *rq) { }
> +
> #endif /* CONFIG_SMP */
>
> static unsigned long wakeup_gran(struct sched_entity *se)
> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
> if (new_tasks > 0)
> goto again;
>
> + update_overload_status(rq);
> +
So here, we are calling update_overload_status() after
newidle_balance(). If we had pulled a single task as a part of
newidle_balance(), in your current code, we do not update the overload
status. While this should get remedied in the next tick, should we
move update_overload_status(rq) prior to the new_tasks > 0 check ?
--
Thanks and Regards
gautham.
Powered by blists - more mailing lists