linux-kernel - Re: [RFC 0/6] mm: improve page allocator scalability via splitting zones

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <eae68813-4240-4de1-6177-0a44e00bd04d@redhat.com>
Date:   Wed, 17 May 2023 10:09:31 +0200
From:   David Hildenbrand <david@...hat.com>
To:     "Huang, Ying" <ying.huang@...el.com>
Cc:     Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        Arjan Van De Ven <arjan@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Vlastimil Babka <vbabka@...e.cz>,
        Johannes Weiner <jweiner@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Pavel Tatashin <pasha.tatashin@...een.com>,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting
 zones

>> If we could avoid instantiating more zones and rather improve existing
>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
>> it's not easy, but that shouldn't stop us from trying ;)
> 
> I do think improving PCP or adding another level of cache will help
> performance and scalability.
> 
> And, I think that it has value too to improve the performance of zone
> itself.  Because there will be always some cases that the zone lock
> itself is contended.
> 
> That is, PCP and zone works at different level, and both deserve to be
> improved.  Do you agree?

Spoiler: my humble opinion

Well, the zone is kind-of your "global" memory provider, and PCPs cache 
a fraction of that to avoid exactly having to mess with that global 
datastructure and lock contention.

One benefit I can see of such a "global" memory provider with caches on 
top is is that it is nicely integrated: for example, the concept of 
memory pressure exists for the zone as a whole. All memory is of the 
same kind and managed in a single entity, but free memory is cached for 
performance.

As soon as you manage the memory in multiple zones of the same kind, you 
lose that "global" view of your memory that is of the same kind, but 
managed in different bucks. You might end up with a lot of memory 
pressure in a single such zone, but still have plenty in another zone.

As one example, hot(un)plug of memory is easy: there is only a single 
zone. No need to make smart decisions or deal with having memory we're 
hotunplugging be stranded in multiple zones.

> 
>> I did not look into the details of this proposal, but seeing the
>> change in include/linux/page-flags-layout.h scares me.
> 
> It's possible for us to use 1 more bit in page->flags.  Do you think
> that will cause severe issue?  Or you think some other stuff isn't
> acceptable?

The issue is, everybody wants to consume more bits in page->flags, so if 
we can get away without it that would be much better :)

The more bits you want to consume, the more people will ask for making 
this a compile-time option and eventually compile it out on distro 
kernels (e.g., with many NUMA nodes). So we end up with more code and 
complexity and eventually not get the benefits where we really want them.

> 
>> Further, I'm not so sure how that change really interacts with
>> hot(un)plug of memory ... on a quick glimpse I feel like this series
>> hacks the code such that such that the split works based on the boot
>> memory size ...
> 
> Em..., the zone stuff is kind of static now.  It's hard to add a zone at
> run-time.  So, in this series, we determine the number of zones per zone
> type based on boot memory size.  This may be improved in the future via
> pre-allocate some empty zone instances during boot and hot-add some
> memory to these zones.

Just to give you some idea: with virtio-mem, hyper-v, daxctl, and 
upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might 
see quite a small boot memory (e.g., 4 GiB) but a significant amount of 
memory getting hotplugged incrementally (e.g., up to 1 TiB) -- well, and 
hotunplugged. With multiple zone instances you really have to be careful 
and might have to re-balance between the multiple zones to keep the 
scalability, to not create imbalances between the zones ...

Something like PCP auto-tuning would be able to handle that mostly 
automatically, as there is only a single memory pool.

> 
>> I agree with Michal that looking into auto-tuning PCP would be
>> preferred. If that can't be done, adding another layer might end up
>> cleaner and eventually cover more use cases.
> 
> I do agree that it's valuable to make PCP etc. cover more use cases.  I
> just think that this should not prevent us from optimizing zone itself
> to cover remaining use cases.

I really don't like the concept of replicating zones of the same kind 
for the same NUMA node. But that's just my personal opinion maintaining 
some memory hot(un)plug code :)

Having that said, some kind of a sub-zone concept (additional layer) as 
outlined by Michal IIUC, for example, indexed by core id/has/whatsoever 
could eventually be worth exploring. Yes, such a design raises various 
questions ... :)

-- 
Thanks,

David / dhildenb