[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7UvchoiRUg_cnhh@gourry-fedora-PF4VCD3F>
Date: Tue, 18 Feb 2025 20:10:10 -0500
From: Gregory Price <gourry@...rry.net>
To: David Hildenbrand <david@...hat.com>
Cc: Yang Shi <shy828301@...il.com>, lsf-pc@...ts.linux-foundation.org,
linux-mm@...ck.org, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
On Tue, Feb 18, 2025 at 09:57:06PM +0100, David Hildenbrand wrote:
> >
> > 2) if memmap_on_memory is on, and hotplug capacity (node1) is
> > zone_movable - then each memory block (256MB) should appear
> > as 252MB (-4MB of 64-byte page structs). For 256GB (my system)
> > I should see a total of 252GB of onlined memory (-4GB of page struct)
>
> In memory_block_online(), we have:
>
> /*
> * Account once onlining succeeded. If the zone was unpopulated, it is
> * now already properly populated.
> */
> if (nr_vmemmap_pages)
> adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
> nr_vmemmap_pages);
>
I've validated the behavior on my system, I just mis-read my results.
memmap_on_memory works as suggested.
What's mildly confusing is for pages used for altmap to be accounted for
as if it's an allocation in vmstat - but for that capacity to be chopped
out of the memory-block (it "makes sense" it's just subtly misleading).
I thought the system was saying i'd allocated memory (from the 'free'
capacity) instead of just reducing capacity.
Thank you for clearing this up.
> >
> > stupid question - it sorta seems like you'd want this as the default
> > setting for driver-managed hotplug memory blocks, but I suppose for
> > very small blocks there's problems (as described in the docs).
>
> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>
That makes sense, i had not considered this. Although it only applies
for small blocks - which is basically an indictment of this suggestion:
https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
So I'll have to consider this and whether this should be a default.
It's probably this is enough to nak this entirely.
... that said ....
Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
regardless of block size (tried both 2GB and 256MB). I can't find a reason
why this would be the case in the existing documentation.
(note: hugepage migration is enabled in build config, so it's not that)
If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
movable (with memmap_on_memory=n) the allocation still fails, and:
nr_slab_unreclaimable 43
in node1/vmstat - where previously there was nothing.
Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
pages to allocate.
This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test
Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
interleave mempolicy - all hugepages end up on ZONE_NORMAL.
(v6.13 base kernel)
This behavior is *curious* to say the least. Not sure if bug, or some
nuance missing from the documentation - but certainly glad I caught it.
> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>
> IIRC, the global toggle must be enabled for the driver option to be considered.
Oh, well, that's an extra layer I missed. So there's:
build:
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
global:
/sys/module/memory_hotplug/parameters/memmap_on_memory
device:
/sys/bus/dax/devices/dax0.0/memmap_on_memory
And looking at it - this does seem to be the default for dax.
So I can drop the existing `nuance movable/memmap` section and just
replace it with the hugetlb subtleties x_x.
I appreciate the clarifications here, sorry for the incorrect info and
the increasing confusing.
~Gregory
Powered by blists - more mailing lists