lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 19 Oct 2023 19:35:40 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Mel Gorman" <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Swapnil Sapkal <Swapnil.Sapkal@....com>,
        Aaron Lu <aaron.lu@...el.com>, Tim Chen <tim.c.chen@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>, <x86@...nel.org>
Subject: Re: [RFC PATCH 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature

On 2023-10-18 at 16:45:10 -0400, Mathieu Desnoyers wrote:
> Introduce the UTIL_FITS_CAPACITY scheduler feature. The runqueue
> selection picks the previous, target, or recent runqueues if they have
> enough remaining capacity to enqueue the task before scanning for an
> idle cpu.
> 
> This feature is introduced in preparation for the SELECT_BIAS_PREV
> scheduler feature. Its performance benefits are noticeable when combined
> with the SELECT_BIAS_PREV feature.
> 
> The following benchmarks only cover the UTIL_FITS_CAPACITY feature.
> Those are performed on a v6.5.5 kernel with mitigations=off.
> 
> The following hackbench workload on a 192 cores AMD EPYC 9654 96-Core
> Processor (over 2 sockets) keeps relatively the same wall time (49s).
> 
> hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
> 
> We can observe that the number of migrations is reduced significantly
> with this patch (improvement):
> 
> Baseline:      117M cpu-migrations  (9.355 K/sec)
> With patch:     67M cpu-migrations  (5.470 K/sec)
> 
> The task-clock utilization is reduced (degradation):
> 
> Baseline:      253.275 CPUs utilized
> With patch:    223.130 CPUs utilized
> 
> The number of context-switches is increased (degradation):
> 
> Baseline:      445M context-switches (35.516 K/sec)
> With patch:    581M context-switches (47.548 K/sec)
> 
> So the improvement due to reduction of migrations is countered by the
> degradation in CPU utilization and context-switches. The following
> SELECT_BIAS_PREV feature will address this.
> 
> Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
> Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
> Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/cover.1695704179.git.yu.c.chen@intel.com/
> Link: https://lore.kernel.org/lkml/20230929183350.239721-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/
> Link: https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Swapnil Sapkal <Swapnil.Sapkal@....com>
> Cc: Aaron Lu <aaron.lu@...el.com>
> Cc: Chen Yu <yu.c.chen@...el.com>
> Cc: Tim Chen <tim.c.chen@...el.com>
> Cc: K Prateek Nayak <kprateek.nayak@....com>
> Cc: Gautham R . Shenoy <gautham.shenoy@....com>
> Cc: x86@...nel.org
> ---
>  kernel/sched/fair.c     | 49 ++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/features.h |  6 +++++
>  kernel/sched/sched.h    |  5 +++++
>  3 files changed, 54 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d9c2482c5a3..8058058afb11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4497,6 +4497,37 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>  	trace_sched_util_est_se_tp(&p->se);
>  }
>  
> +/*
> + * Returns true if adding the task utilization to the estimated
> + * utilization of the runnable tasks on @cpu does not exceed the
> + * capacity of @cpu.
> + *
> + * This considers only the utilization of _runnable_ tasks on the @cpu
> + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> + * using the runqueue util_est.enqueued, and by estimating the capacity
> + * of @cpu based on arch_scale_cpu_capacity and arch_scale_thermal_pressure
> + * rather than capacity_of() because capacity_of() considers
> + * blocked/sleeping tasks in other scheduler classes.
> + *
> + * The utilization vs capacity comparison is done without the margin
> + * provided by fits_capacity(), because fits_capacity() is used to
> + * validate whether the utilization of a task fits within the overall
> + * capacity of a cpu, whereas this function validates whether the task
> + * utilization fits within the _remaining_ capacity of the cpu, which is
> + * more precise.
> + */
> +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> +						    int cpu)
> +{
> +	unsigned long total_util, capacity;
> +
> +	if (!sched_util_fits_capacity_active())
> +		return false;
> +	total_util = READ_ONCE(cpu_rq(cpu)->cfs.avg.util_est.enqueued) + task_util;
> +	capacity = arch_scale_cpu_capacity(cpu) - arch_scale_thermal_pressure(cpu);

scale_rt_capacity(cpu) could provide the remaining cpu capacity after substracted by
the side activity(rt tasks/thermal pressure/irq time), maybe it would be more accurate?

> +	return total_util <= capacity;
> +}
> +
>  static inline int util_fits_cpu(unsigned long util,
>  				unsigned long uclamp_min,
>  				unsigned long uclamp_max,
> @@ -7124,12 +7155,15 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	int i, recent_used_cpu;
>  
>  	/*
> -	 * On asymmetric system, update task utilization because we will check
> -	 * that the task fits with cpu's capacity.
> +	 * With the UTIL_FITS_CAPACITY feature and on asymmetric system,
> +	 * update task utilization because we will check that the task
> +	 * fits with cpu's capacity.
>  	 */
> -	if (sched_asym_cpucap_active()) {
> +	if (sched_util_fits_capacity_active() || sched_asym_cpucap_active()) {
>  		sync_entity_load_avg(&p->se);
>  		task_util = task_util_est(p);
> +	}
> +	if (sched_asym_cpucap_active()) {
>  		util_min = uclamp_eff_value(p, UCLAMP_MIN);
>  		util_max = uclamp_eff_value(p, UCLAMP_MAX);
>  	}
> @@ -7139,7 +7173,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	 */
>  	lockdep_assert_irqs_disabled();
>  
> -	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> +	if ((available_idle_cpu(target) || sched_idle_cpu(target) ||
> +	    task_fits_remaining_cpu_capacity(task_util, target)) &&

Compared to the previous version posted here[1], when the cpu's util_est is lower than 25% of CPU
capacity we choose the previous CPU, current version seems to be more aggressive.
it is possible that a short running task is queued on the near 100% busy cpu while there
is still an idle cpu in the system.

https://lore.kernel.org/lkml/20231017221204.1535774-1-mathieu.desnoyers@efficios.com/

thanks,
Chenyu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ