linux-kernel - Re: newidle balancing in NUMA domain?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20091124063653.GB20981@wotan.suse.de>
Date:	Tue, 24 Nov 2009 07:36:53 +0100
From:	Nick Piggin <npiggin@...e.de>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: newidle balancing in NUMA domain?

On Mon, Nov 23, 2009 at 01:46:15PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@...e.de> wrote:
> 
> > > (I hope i explained my point clearly enough.)
> > > 
> > > No argument that it could be done cleaner - the duality right now of 
> > > both having the fuzzy stats and the rate limiting should be decided 
> > > one way or another.
> > 
> > Well, I would say please keep domain balancing behaviour at least 
> > somewhat close to how it was with O(1) scheduler at least until CFS is 
> > more sorted out. There is no need to knee jerk because BFS is better 
> > at something.
> 
> Well, see the change (commit 0ec9fab3d) attached below - even in 
> hindsight it was well argued by Mike with quite a few numbers: +68% in 
> x264 performance.

Quite a few being one test case, and on a program with a horrible
parallelism design (rapid heavy weight forks to distribute small
units of work).

 
> If you think it's a bad change (combined with newidle-rate-limit) then 
> please show the numbers that counter it and preferably extend 'perf 
> bench sched' with the testcases that you care about most.
> 
> IIRC (and Peter mentioned this too) Mike got into the whole kbuild angle 
> due to comparing mainline scheduler to BFS - but that's really a 
> positive thing IMO, not some "knee-jerk reaction".

It is a knee-jerk "we have to be better than BFS" reaction. It is
quite obvious that vasty increasing inter node balancing and remote
cacheline touches is not a good idea (unless it helps _reasonably_
designed programs without unduely hurting others).

And when you are changing all this desktop interactivity stuff, I
would ask please not to make these kinds of "improvements" to domains
balancing. I don't think that is too much to ask given track record
of scheduler problems.

Finally, I didn't realise I would have to be responsible for shooting
down every "ooh shiny" change that gets pushed into the scheduler. I
would have hoped they are ensured to be rather well thought out and
reviewed and tested first.

I do think it is a bad change, for extensive reasons I explained. It
will obviously cause regressions on workloads with lots of newidle
events that are un-balanceable (which would include _well tuned_ net
work and data base servers). And it does. I thought you'd have been
aware of that.

http://patchwork.kernel.org/patch/58931/

> 
> 	Ingo
> 
> ----------------------->
> >From 0ec9fab3d186d9cbb00c0f694d4a260d07c198d9 Mon Sep 17 00:00:00 2001
> From: Mike Galbraith <efault@....de>
> Date: Tue, 15 Sep 2009 15:07:03 +0200
> Subject: [PATCH] sched: Improve latencies and throughput
> 
> Make the idle balancer more agressive, to improve a
> x264 encoding workload provided by Jason Garrett-Glaser:
> 
>  NEXT_BUDDY NO_LB_BIAS
>  encoded 600 frames, 252.82 fps, 22096.60 kb/s
>  encoded 600 frames, 250.69 fps, 22096.60 kb/s
>  encoded 600 frames, 245.76 fps, 22096.60 kb/s
> 
>  NO_NEXT_BUDDY LB_BIAS
>  encoded 600 frames, 344.44 fps, 22096.60 kb/s
>  encoded 600 frames, 346.66 fps, 22096.60 kb/s
>  encoded 600 frames, 352.59 fps, 22096.60 kb/s
> 
>  NO_NEXT_BUDDY NO_LB_BIAS
>  encoded 600 frames, 425.75 fps, 22096.60 kb/s
>  encoded 600 frames, 425.45 fps, 22096.60 kb/s
>  encoded 600 frames, 422.49 fps, 22096.60 kb/s
> 
> Peter pointed out that this is better done via newidle_idx,
> not via LB_BIAS, newidle balancing should look for where
> there is load _now_, not where there was load 2 ticks ago.
> 
> Worst-case latencies are improved as well as no buddies
> means less vruntime spread. (as per prior lkml discussions)
> 
> This change improves kbuild-peak parallelism as well.
> 
> Reported-by: Jason Garrett-Glaser <darkshikari@...il.com>
> Signed-off-by: Mike Galbraith <efault@....de>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> LKML-Reference: <1253011667.9128.16.camel@...ge.simson.net>
> Signed-off-by: Ingo Molnar <mingo@...e.hu>
> ---
>  arch/ia64/include/asm/topology.h    |    5 +++--
>  arch/powerpc/include/asm/topology.h |    2 +-
>  arch/sh/include/asm/topology.h      |    3 ++-
>  arch/x86/include/asm/topology.h     |    4 +---
>  include/linux/topology.h            |    2 +-
>  kernel/sched_features.h             |    2 +-
>  6 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
> index 47f3c51..42f1673 100644
> --- a/arch/ia64/include/asm/topology.h
> +++ b/arch/ia64/include/asm/topology.h
> @@ -61,7 +61,7 @@ void build_cpu_to_node_map(void);
>  	.cache_nice_tries	= 2,			\
>  	.busy_idx		= 2,			\
>  	.idle_idx		= 1,			\
> -	.newidle_idx		= 2,			\
> +	.newidle_idx		= 0,			\
>  	.wake_idx		= 0,			\
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
> @@ -87,10 +87,11 @@ void build_cpu_to_node_map(void);
>  	.cache_nice_tries	= 2,			\
>  	.busy_idx		= 3,			\
>  	.idle_idx		= 2,			\
> -	.newidle_idx		= 2,			\
> +	.newidle_idx		= 0,			\
>  	.wake_idx		= 0,			\
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
> +				| SD_BALANCE_NEWIDLE	\
>  				| SD_BALANCE_EXEC	\
>  				| SD_BALANCE_FORK	\
>  				| SD_BALANCE_WAKE	\
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index a6b220a..1a2c9eb 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -57,7 +57,7 @@ static inline int pcibus_to_node(struct pci_bus *bus)
>  	.cache_nice_tries	= 1,			\
>  	.busy_idx		= 3,			\
>  	.idle_idx		= 1,			\
> -	.newidle_idx		= 2,			\
> +	.newidle_idx		= 0,			\
>  	.wake_idx		= 0,			\
>  	.flags			= SD_LOAD_BALANCE	\
>  				| SD_BALANCE_EXEC	\
> diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
> index 9054e5c..c843677 100644
> --- a/arch/sh/include/asm/topology.h
> +++ b/arch/sh/include/asm/topology.h
> @@ -15,13 +15,14 @@
>  	.cache_nice_tries	= 2,			\
>  	.busy_idx		= 3,			\
>  	.idle_idx		= 2,			\
> -	.newidle_idx		= 2,			\
> +	.newidle_idx		= 0,			\
>  	.wake_idx		= 0,			\
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
>  				| SD_BALANCE_FORK	\
>  				| SD_BALANCE_EXEC	\
>  				| SD_BALANCE_WAKE	\
> +				| SD_BALANCE_NEWIDLE	\
>  				| SD_SERIALIZE,		\
>  	.last_balance		= jiffies,		\
>  	.balance_interval	= 1,			\
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 4b1b335..7fafd1b 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -116,14 +116,12 @@ extern unsigned long node_remap_size[];
>  
>  # define SD_CACHE_NICE_TRIES	1
>  # define SD_IDLE_IDX		1
> -# define SD_NEWIDLE_IDX		2
>  # define SD_FORKEXEC_IDX	0
>  
>  #else
>  
>  # define SD_CACHE_NICE_TRIES	2
>  # define SD_IDLE_IDX		2
> -# define SD_NEWIDLE_IDX		2
>  # define SD_FORKEXEC_IDX	1
>  
>  #endif
> @@ -137,7 +135,7 @@ extern unsigned long node_remap_size[];
>  	.cache_nice_tries	= SD_CACHE_NICE_TRIES,			\
>  	.busy_idx		= 3,					\
>  	.idle_idx		= SD_IDLE_IDX,				\
> -	.newidle_idx		= SD_NEWIDLE_IDX,			\
> +	.newidle_idx		= 0,					\
>  	.wake_idx		= 0,					\
>  	.forkexec_idx		= SD_FORKEXEC_IDX,			\
>  									\
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index c87edcd..4298745 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -151,7 +151,7 @@ int arch_update_cpu_topology(void);
>  	.cache_nice_tries	= 1,					\
>  	.busy_idx		= 2,					\
>  	.idle_idx		= 1,					\
> -	.newidle_idx		= 2,					\
> +	.newidle_idx		= 0,					\
>  	.wake_idx		= 0,					\
>  	.forkexec_idx		= 1,					\
>  									\
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index 891ea0f..e98c2e8 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -67,7 +67,7 @@ SCHED_FEAT(AFFINE_WAKEUPS, 1)
>   * wakeup-preemption), since its likely going to consume data we
>   * touched, increases cache locality.
>   */
> -SCHED_FEAT(NEXT_BUDDY, 1)
> +SCHED_FEAT(NEXT_BUDDY, 0)
>  
>  /*
>   * Prefer to schedule the task that ran last (when we did
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/