linux-kernel - Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <20160819115212.1c5eba20@TP-holzheu>
Date:   Fri, 19 Aug 2016 11:52:12 +0200
From:   Michael Holzheu <holzheu@...ux.vnet.ibm.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Heiko Carstens <heiko.carstens@...ibm.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ming Lei <tom.leiming@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Yasuaki Ishimatsu <isimatu.yasuaki@...fujitsu.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Lai Jiangshan <laijs@...fujitsu.com>,
        Martin Schwidefsky <schwidefsky@...ibm.com>
Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on
 online && !active" causes warning

Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <tj@...nel.org>:

> Hello, Michael.
> 
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> > 
> > Example:
> > 
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> >    assignment:
> > 
> >    - N1: cpu 1 on chip 1
> >    - N2: cpu 2 on chip 2
> > 
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> > 
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
> 
> I see.  Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns.  Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

 - Zone locks are highly contended because ZONE_NORMAL is big:
   * zone->lock
   * zone->lru_lock
 - One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

 - At boot time we already pin all remaining "not configured" logical
   CPUs to nodes. So all possible cpus are pinned to nodes and
   cpu_to_node() will work.

 - If a new physical cpu get's configured, we get the CPU topology
   information from the system and find the best node.

 - We get a logical cpu number from the node pool and assign the
   new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael