[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20160819115212.1c5eba20@TP-holzheu>
Date: Fri, 19 Aug 2016 11:52:12 +0200
From: Michael Holzheu <holzheu@...ux.vnet.ibm.com>
To: Tejun Heo <tj@...nel.org>
Cc: Heiko Carstens <heiko.carstens@...ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
Ming Lei <tom.leiming@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
LKML <linux-kernel@...r.kernel.org>,
Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Lai Jiangshan <laijs@...fujitsu.com>,
Martin Schwidefsky <schwidefsky@...ibm.com>
Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on
online && !active" causes warning
Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <tj@...nel.org>:
> Hello, Michael.
>
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> >
> > Example:
> >
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> > assignment:
> >
> > - N1: cpu 1 on chip 1
> > - N2: cpu 2 on chip 2
> >
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> >
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
>
> I see. Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns. Is it
> just to segment memory of a machine into multiple pieces?
Correct.
> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?
Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:
- Zone locks are highly contended because ZONE_NORMAL is big:
* zone->lock
* zone->lru_lock
- One kswapd is not enough for swapping
We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.
> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?
IMHO this is a promising idea. To say it in my words:
- At boot time we already pin all remaining "not configured" logical
CPUs to nodes. So all possible cpus are pinned to nodes and
cpu_to_node() will work.
- If a new physical cpu get's configured, we get the CPU topology
information from the system and find the best node.
- We get a logical cpu number from the node pool and assign the
new physical cpu to that number.
If that works we would be as good as before. We will have a look into
the code if it is possible.
Michael
Powered by blists - more mailing lists