linux-kernel - Re: pcpu allocator on large NUMA machines

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877eyxz4r8.fsf@concordia.ellerman.id.au>
Date:   Tue, 25 Jul 2017 11:26:03 +1000
From:   Michael Ellerman <mpe@...erman.id.au>
To:     Michal Hocko <mhocko@...nel.org>, Tejun Heo <tj@...nel.org>
Cc:     Jiri Kosina <jkosina@...e.cz>, linux-mm@...ck.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: pcpu allocator on large NUMA machines

Michal Hocko <mhocko@...nel.org> writes:

> On Mon 24-07-17 09:57:14, Tejun Heo wrote:
>> On Mon, Jul 24, 2017 at 03:42:40PM +0200, Michal Hocko wrote:
> [...]
>> > My understanding of the pcpu allocator is basically close to zero but it
>> > seems weird to me that we would need many TB of vmalloc address space
>> > just to allocate vmalloc areas that are in range of hundreds of MB. So I
>> > am wondering whether this is an expected behavior of the allocator or
>> > there is a problem somwehere else.
>> 
>> It's not actually using the entire region but the area allocations try
>> to follow the same topology as kernel linear address layouts.  ie. if
>> kernel address for different NUMA nodes are apart by certain amount,
>> the percpu allocator tries to replicate that for dynamic allocations
>> which allows leaving the static and first dynamic area in the kernel
>> linear address which helps reducing TLB pressure.
>> 
>> This optimization can be turned off when vmalloc area isn't spacious
>> enough by using pcpu_page_first_chunk() instead of
>> pcpu_embed_first_chunk() while initializing percpu allocator.
>
> Thanks for the clarification, this is really helpful!
>
>> Can you
>> see whether replacing that in arch/powerpc/kernel/setup_64.c fixes the
>> issue?  If so, all it needs to do is figuring out what conditions we
>> need to check to opt out of embedding the first chunk.  Note that x86
>> 32bit does about the same thing.
>
> Hmm, I will need some help from PPC guys here. I cannot find something
> ready to implement pcpup_populate_pte and I am not familiar with ppc
> memory model to implement one myself.

I don't think we want to stop using embed first chunk unless we have to.

We have code that accesses percpu variables in real mode (with the MMU
off), and that wouldn't work easily if the first chunk wasn't in the
linear mapping. So it's not just an optimisation for us.

We can fairly easily make the vmalloc space 56T, and I'm working on a
patch to make it ~500T on newer machines.

cheers