linux-kernel - Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YhcvUV/jW7yr0Sn+@BLR-5CG11610CF.amd.com>
Date:   Thu, 24 Feb 2022 12:40:09 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Abel Wu <wuyun.abel@...edance.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel@...r.kernel.org, srikar@...ux.vnet.ibm.com,
        aubrey.li@...el.com
Subject: Re: [RFC PATCH 1/5] sched/fair: record overloaded cpus

Hello Abel,

(+ Aubrey Li, Srikar)

On Thu, Feb 17, 2022 at 11:43:57PM +0800, Abel Wu wrote:
> An CFS runqueue is considered overloaded when there are
> more than one pullable non-idle tasks on it (since sched-
> idle cpus are treated as idle cpus). And idle tasks are
> counted towards rq->cfs.idle_h_nr_running, that is either
> assigned SCHED_IDLE policy or placed under idle cgroups.
> 
> The overloaded cfs rqs can cause performance issues to
> both task types:
> 
>   - for latency critical tasks like SCHED_NORMAL,
>     time of waiting in the rq will increase and
>     result in higher pct99 latency, and
> 
>   - batch tasks may not be able to make full use
>     of cpu capacity if sched-idle rq exists, thus
>     presents poorer throughput.
> 
> The mask of overloaded cpus is updated in periodic tick
> and the idle path at the LLC domain basis. This cpumask
> will also be used in SIS as a filter, improving idle cpu
> searching.

This is an interesting approach to minimise the tail latencies by
keeping track of the overloaded cpus in the LLC so that
idle/sched-idle CPUs can pull from them. This approach contrasts with the
following approaches that were previously tried :

1. Maintain the idle cpumask at the LLC level by Aubrey Li
   https://lore.kernel.org/all/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
   
2. Maintain the identity of the idle core itself at the LLC level, by Srikar :
   https://lore.kernel.org/lkml/20210513074027.543926-3-srikar@linux.vnet.ibm.com/

There have been concerns in the past about having to update the shared
mask/counter at regular intervals. Srikar, Aubrey any thoughts on this
?



> 
> Signed-off-by: Abel Wu <wuyun.abel@...edance.com>
> ---
>  include/linux/sched/topology.h | 10 ++++++++++
>  kernel/sched/core.c            |  1 +
>  kernel/sched/fair.c            | 43 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h           |  6 ++++++
>  kernel/sched/topology.c        |  4 +++-
>  5 files changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..03c9c81dc886 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,6 +81,16 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +
> +	/*
> +	 * The above varibles are used in idle path and
> +	 * select_task_rq, and the following two are
> +	 * mainly updated in tick. They are all hot but
> +	 * for different usage, so start a new cacheline
> +	 * to avoid false sharing.
> +	 */
> +	atomic_t	nr_overloaded	____cacheline_aligned;
> +	unsigned long	overloaded[];	/* Must be last */
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d863d7f6ad7..a6da2998ec49 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9423,6 +9423,7 @@ void __init sched_init(void)
>  		rq->wake_stamp = jiffies;
>  		rq->wake_avg_idle = rq->avg_idle;
>  		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> +		rq->overloaded = 0;
>  
>  		INIT_LIST_HEAD(&rq->cfs_tasks);
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..0a0438c3319b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6968,6 +6968,46 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  
>  	return newidle_balance(rq, rf) != 0;
>  }
> +
> +static inline int cfs_rq_overloaded(struct rq *rq)
> +{
> +	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
> +}
> +
> +/* Must be called with rq locked */
> +static void update_overload_status(struct rq *rq)
> +{
> +	struct sched_domain_shared *sds;
> +	int overloaded = cfs_rq_overloaded(rq);
> +	int cpu = cpu_of(rq);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (rq->overloaded == overloaded)
> +		return;
> +
> +	rcu_read_lock();
> +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (unlikely(!sds))
> +		goto unlock;
> +
> +	if (overloaded) {
> +		cpumask_set_cpu(cpu, sdo_mask(sds));
> +		atomic_inc(&sds->nr_overloaded);
> +	} else {
> +		cpumask_clear_cpu(cpu, sdo_mask(sds));
> +		atomic_dec(&sds->nr_overloaded);
> +	}
> +
> +	rq->overloaded = overloaded;
> +unlock:
> +	rcu_read_unlock();
> +}
> +
> +#else
> +
> +static inline void update_overload_status(struct rq *rq) { }
> +
>  #endif /* CONFIG_SMP */
>  
>  static unsigned long wakeup_gran(struct sched_entity *se)
> @@ -7315,6 +7355,8 @@ done: __maybe_unused;
>  	if (new_tasks > 0)
>  		goto again;
>  
> +	update_overload_status(rq);
> +

So here, we are calling update_overload_status() after
newidle_balance(). If we had pulled a single task as a part of
newidle_balance(), in your current code, we do not update the overload
status. While this should get remedied in the next tick, should we
move update_overload_status(rq) prior to the new_tasks > 0 check ?


--
Thanks and Regards
gautham.