linux-kernel - Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xm26d1xbader.fsf@sword-of-the-dawn.mtv.corp.google.com>
Date:	Mon, 21 Sep 2015 10:30:04 -0700
From:	bsegall@...gle.com
To:	Yuyang Du <yuyang.du@...el.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steve Muckle <steve.muckle@...aro.org>,
	"mingo\@redhat.com" <mingo@...hat.com>,
	"daniel.lezcano\@linaro.org" <daniel.lezcano@...aro.org>,
	"mturquette\@baylibre.com" <mturquette@...libre.com>,
	"rjw\@rjwysocki.net" <rjw@...ysocki.net>,
	Juri Lelli <Juri.Lelli@....com>,
	"sgurrappadi\@nvidia.com" <sgurrappadi@...dia.com>,
	"pang.xunlei\@zte.com.cn" <pang.xunlei@....com.cn>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig

Yuyang Du <yuyang.du@...el.com> writes:

> On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
>> 
>> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
>> > mess?
>> 
>> I suspect his patch will fail to compile on ARM which uses
>> SCHED_CAPACITY_* outside of kernel/sched/*.
>> 
>> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
>> things I'm not opposed to it.
>
> Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h.
>
> With this, I think the codes still need some cleanup, and importantly
> documentation.
>
> But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> or low resolution. So we have no reason to have low resolution (10bits) load_avg
> when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> as opposed to now we have load_avg = runnable% * scale_load_down(load).
>
> We get rid of all scale_load_down() for runnable load average?

Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
going to give errors on 32-bit (even with the old code in fact). This
should probably be fixed... somehow (dividing by 4 for load_sum on
32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
32-bit might have made sense but would be a weird difference between 32
and 64, and could break userspace anyway, so it's presumably too late
for that).

64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
32-bit.

>
> --
>
> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
>  definition
>
> The metric needs certain resolution to determine how much detail we
> can look into (or not losing detail by integer rounding), which also
> determines the range of the metrics.
>
> For instance, to increase the resolution of [0, 1] (two levels), one
> can multiply 1024 and get [0, 1024] (1025 levels).
>
> In sched/fair, a few metrics depend on the resolution: load/load_avg,
> util_avg, and capacity (frequency adjustment). In order to reduce the
> risks to make mistakes relating to resolution/range, we therefore
> generalize the resolution by defining a basic resolution constant
> number, and then formalize all metrics by depending on the basic
> resolution. The basic resolution is 1024 or (1 << 10). Further, one
> can recursively apply the basic resolution to increase the final
> resolution.
>
> Pointed out by Ben Segall, NICE_0's weight (visible to user) and load
> have independent resolution, but they must be well calibrated.
>
> Signed-off-by: Yuyang Du <yuyang.du@...el.com>
> ---
>  include/linux/sched.h |  9 ++++++---
>  kernel/sched/fair.c   |  4 ----
>  kernel/sched/sched.h  | 15 ++++++++++-----
>  3 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bd38b3e..9b86f79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -909,10 +909,13 @@ enum cpu_idle_type {
>  	CPU_MAX_IDLE_TYPES
>  };
>  
> +# define SCHED_RESOLUTION_SHIFT	10
> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> +
>  /*
>   * Increase resolution of cpu_capacity calculations
>   */
> -#define SCHED_CAPACITY_SHIFT	10
> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
>  /*
> @@ -1180,8 +1183,8 @@ struct load_weight {
>   * 1) load_avg factors frequency scaling into the amount of time that a
>   * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
>   * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency and cpu scaling into the amount of time
> - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
> + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
> + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
>   * For cfs_rq, it is the aggregated such times of all runnable and
>   * blocked sched_entities.
>   * The 64 bit load_sum can:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..c61fd8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n)
>  	return contrib + runnable_avg_yN_sum[n];
>  }
>  
> -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> -#error "load tracking assumes 2^10 as unit"
> -#endif
> -
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3845a71..31b4022 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>   * increased costs.
>   */
>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> -# define SCHED_LOAD_RESOLUTION	10
> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>  #else
> -# define SCHED_LOAD_RESOLUTION	0
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
>  #endif
>  
> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>  
> +/*
> + * NICE_0's weight (visible to user) and its load (invisible to user) have
> + * independent resolution, but they should be well calibrated. We use scale_load()
> + * and scale_load_down(w) to convert between them, the following must be true:
> + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> + */
>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT

I still think tying the scale_load shift to be the same as the
SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
worse. Honestly if I was going to change anything it would be to define
NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is
actually a separate use of 1024*SLR-as-percentage or is basically
assuming most tasks are nice-0 or what. It sure /looks/ like it's
comparing values with different units - it's doing (nr_running * CONST -
group_capacity) and comparing to load, so it looks like both (ie
increasing load.weight of everything on your system by X% would change
load balancer behavior here).

Given that it might make sense to make it clear that capacity units and
nice-0-task units have to be the same thing due to load balancer
approximations (though they are still entirely separate from the
SCHED_LOAD_RESOLUTION multiplier).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/