linux-kernel - Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com>
Date:   Wed, 12 May 2021 18:14:06 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Zi Yan <ziy@...dia.com>, Michal Hocko <mhocko@...e.com>
Cc:     Oscar Salvador <osalvador@...e.de>,
        Michael Ellerman <mpe@...erman.id.au>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
        Andy Lutomirski <luto@...nel.org>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Rapoport <rppt@...nel.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Dan Williams <dan.j.williams@...el.com>,
        Wei Yang <richard.weiyang@...ux.alibaba.com>,
        linux-ia64@...r.kernel.org, linux-kernel@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

>>
>> As stated somewhere here already, we'll have to look into making alloc_contig_range() (and main users CMA and virtio-mem) independent of MAX_ORDER and mainly rely on pageblock_order. The current handling in alloc_contig_range() is far from optimal as we have to isolate a whole MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains something unmovable although we don't even want to allocate that part. I actually have that on my list (to be able to fully support pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look into it.
> 
> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range()
> should be used instead of buddy allocator while pageblock_order is kept at a small
> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happening as
> an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea?
> 

I am not fully sure yet where the journey will go , I guess nobody 
knows. Ultimately, having buddy support for >= current MAX_ORDER (IOW, 
increasing MAX_ORDER) will most probably happen, so it would be worth 
investigating what has to be done to get that running as a first step.

Of course, we could temporarily think about wiring it up in the buddy like

if (order < MAX_ORDER)
	__alloc_pages()...
else
	alloc_contig_pages()

but it doesn't really improve the situation IMHO, just an API change.

So I think we should look into increasing MAX_ORDER, seeing what needs 
to be done to have that part running while keeping the section size and 
the pageblock order as is. I know that at least memory 
onlining/offlining, cma, alloc_contig_range(), ... needs tweaking, 
especially when we don't increase the section size (but also if we would 
due to the way page isolation is currently handled). Having a MAX_ORDER 
-1 page being partially in different nodes might be another thing to 
look into (I heard that it can already happen right now, but I don't 
remember the details).

The next step after that would then be better fragmentation avoidance 
for larger granularity like 1G THP.

>>
>> Further, page onlining / offlining code and early init code most probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly have in MAX_ORDER - 1 pages might become a problem and will have to be handled. Not sure which other code has to be tweaked (compaction? page isolation?).
> 
> Can you elaborate it a little more? From what I understand, memory holes mean valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional code would be
> needed instead of simple pfn++. Is there anything I am missing?

I think there are two cases when talking about MAX_ORDER and memory holes:

1. Hole with a valid memmap: the memmap is initialize to PageReserved()
    and the pages are not given to the buddy. pfn_valid() and
    pfn_to_page() works as expected.
2. Hole without a valid memmam: we have that CONFIG_HOLES_IN_ZONE thing
    already, see include/linux/mmzone.h. pfn_valid_within() checks are
    required. Doesn't win a beauty contest, but gets the job done in
    existing setups that seem to care.

"If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we 
need to check pfn validity within that MAX_ORDER_NR_PAGES block. 
pfn_valid_within() should be used in this case; we optimise this away 
when we have no holes within a MAX_ORDER_NR_PAGES block."

CONFIG_HOLES_IN_ZONE is just a bad name for this.

(increasing the section size implies that we waste more memory for the 
memmap in holes. increasing MAX_ORDER means that we might have to deal 
with holes within MAX_ORDER chunks)

We don't have too many pfn_valid_within() checks. I wonder if we could 
add something that is optimized for "holes are a power of two and 
properly aligned", because pfn_valid_within() right not deals with holes 
of any kind which makes it somewhat inefficient IIRC.

> 
> BTW, to test a system with memory holes, do you know is there an easy of adding
> random memory holes to an x86_64 VM, which can help reveal potential missing pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
> how to do it on QEMU.

It might not be very easy that way. But I heard that some arm64 systems 
have crazy memory layouts -- maybe there, it's easier to get something 
nasty running? :)

https://lkml.kernel.org/r/YJpEwF2cGjS5mKma@kernel.org

I remember there was a way to define the e820 completely on kernel 
cmdline, but I might be wrong ...

-- 
Thanks,

David / dhildenb