linux-kernel - Re: [PATCH v2 0/5] Allocate memmap from hotadded memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2750c11a-524d-b248-060c-49e6b3eb8975@redhat.com>
Date:   Wed, 26 Jun 2019 10:11:06 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Oscar Salvador <osalvador@...e.de>
Cc:     akpm@...ux-foundation.org, mhocko@...e.com,
        dan.j.williams@...el.com, pasha.tatashin@...een.com,
        Jonathan.Cameron@...wei.com, anshuman.khandual@....com,
        vbabka@...e.cz, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/5] Allocate memmap from hotadded memory

On 26.06.19 10:03, Oscar Salvador wrote:
> On Tue, Jun 25, 2019 at 10:25:48AM +0200, David Hildenbrand wrote:
>>> [Coverletter]
>>>
>>> This is another step to make memory hotplug more usable. The primary
>>> goal of this patchset is to reduce memory overhead of the hot-added
>>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
>>> to populate memmap (struct page array) has two main drawbacks:
> 
> First off, thanks for looking into this :-)

Thanks for working on this ;)

> 
>>
>> Mental note: How will it be handled if a caller specifies "Allocate
>> memmap from hotadded memory", but we are running under SPARSEMEM where
>> we can't do this.
> 
> In add_memory_resource(), we have a call to mhp_check_correct_flags(), which is
> in charge of checking if the flags passed are compliant with our configuration
> among other things.
> It also checks if both flags were passed (_MEMBLOCK|_DEVICE).
> 
> If a) any of the flags were specified and we are not on CONFIG_SPARSEMEM_VMEMMAP,
> b) the flags are colliding with each other or c) the flags just do not make sense,
> we print out a warning and drop the flags to 0, so we just ignore them.
> 
> I just realized that I can adjust the check even more (something for the next
> version).
> 
> But to answer your question, flags are ignored under !CONFIG_SPARSEMEM_VMEMMAP.

So it is indeed a hint only.

> 
>>
>>>
>>> a) it consumes an additional memory until the hotadded memory itself is
>>>    onlined and
>>> b) memmap might end up on a different numa node which is especially true
>>>    for movable_node configuration.
>>>
>>> a) it is a problem especially for memory hotplug based memory "ballooning"
>>>    solutions when the delay between physical memory hotplug and the
>>>    onlining can lead to OOM and that led to introduction of hacks like auto
>>>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>>    policy for the newly added memory")).
>>>
>>> b) can have performance drawbacks.
>>>
>>> Another minor case is that I have seen hot-add operations failing on archs
>>> because they were running out of order-x pages.
>>> E.g On powerpc, in certain configurations, we use order-8 pages,
>>> and given 64KB base pagesize, that is 16MB.
>>> If we run out of those, we just fail the operation and we cannot add
>>> more memory.
>>
>> At least for SPARSEMEM, we fallback to vmalloc() to work around this
>> issue. I haven't looked into the populate_section_memmap() internals
>> yet. Can you point me at the code that performs this allocation?
> 
> Yes, on SPARSEMEM we first try to allocate the pages physical configuous, and
> then fallback to vmalloc.
> This is because on CONFIG_SPARSEMEM memory model, the translations pfn_to_page/
> page_to_pfn do not expect the memory to be contiguous.
> 
> But that is not the case on CONFIG_SPARSEMEM_VMEMMAP.
> We do expect the memory to be physical contigous there, that is why a simply
> pfn_to_page/page_to_pfn is a matter of adding/substracting vmemmap/pfn.

Yeas, I explored that last week but didn't figure out where the actual
vmmap population code resided - thanks :)

> 
> Powerpc code is at:
> 
> https://elixir.bootlin.com/linux/v5.2-rc6/source/arch/powerpc/mm/init_64.c#L175
> 
> 
> 
>> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then
>> remove_memory(128MB) of the added memory, this will work?
> 
> No, MHP_MEMMAP_DEVICE is meant to be used when hot-adding and hot-removing work
> in the same granularity.
> This is because all memmap pages will be stored at the beginning of the memory
> range.
> Allowing hot-removing in a different granularity on MHP_MEMMAP_DEVICE would imply
> a lot of extra work.
> For example, we would have to parse the vmemmap-head of the hot-removed range,
> and punch a hole in there to clear the vmemmap pages, and then be very carefull
> when deleting those pagetables.
> 
> So I followed Michal's advice, and I decided to let the caller specify if he
> either wants to allocate per memory block or per hot-added range(device).
> Where per memory block, allows us to do:
> 
> add_memory(1GB, MHP_MEMMAP_MEMBLOCK)
> remove_memory(128MB)

Back then, I already mentioned that we might have some users that
remove_memory() they never added in a granularity it wasn't added. My
concerns back then were never fully sorted out.

arch/powerpc/platforms/powernv/memtrace.c

- Will remove memory in memory block size chunks it never added
- What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE?

Will it at least bail out? Or simply break?

IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be
introduced.

-- 

Thanks,

David / dhildenb