linux-kernel - Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e332391c-30fb-49c3-9c05-574b0c486a81@redhat.com>
Date: Wed, 19 Feb 2025 09:53:07 +0100
From: David Hildenbrand <david@...hat.com>
To: Gregory Price <gourry@...rry.net>
Cc: Yang Shi <shy828301@...il.com>, lsf-pc@...ts.linux-foundation.org,
 linux-mm@...ck.org, linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug

> What's mildly confusing is for pages used for altmap to be accounted for
> as if it's an allocation in vmstat - but for that capacity to be chopped
> out of the memory-block (it "makes sense" it's just subtly misleading).

Would the following make it better or worse?

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4765f2928725c..17a4432427051 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -237,9 +237,12 @@ static int memory_block_online(struct memory_block *mem)
          * Account once onlining succeeded. If the zone was unpopulated, it is
          * now already properly populated.
          */
-       if (nr_vmemmap_pages)
+       if (nr_vmemmap_pages) {
                 adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
                                           nr_vmemmap_pages);
+               adjust_managed_page_count(pfn_to_page(start_pfn),
+                                         nr_vmemmap_pages);
+       }
  
         mem->zone = zone;
         mem_hotplug_done();
@@ -273,17 +276,23 @@ static int memory_block_offline(struct memory_block *mem)
                 nr_vmemmap_pages = mem->altmap->free;
  
         mem_hotplug_begin();
-       if (nr_vmemmap_pages)
+       if (nr_vmemmap_pages) {
                 adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
                                           -nr_vmemmap_pages);
+               adjust_managed_page_count(pfn_to_page(start_pfn),
+                                         -nr_vmemmap_pages);
+       }
  
         ret = offline_pages(start_pfn + nr_vmemmap_pages,
                             nr_pages - nr_vmemmap_pages, mem->zone, mem->group);
         if (ret) {
                 /* offline_pages() failed. Account back. */
-               if (nr_vmemmap_pages)
+               if (nr_vmemmap_pages) {
                         adjust_present_page_count(pfn_to_page(start_pfn),
                                                   mem->group, nr_vmemmap_pages);
+                       adjust_managed_page_count(pfn_to_page(start_pfn),
+                                                 nr_vmemmap_pages);
+               }
                 goto out;
         }
  
Then, it would look "just like allocated memory" from that node/zone.

As if, the memmap was allocated immediately when we onlined the memory
(see below).

> 
> I thought the system was saying i'd allocated memory (from the 'free'
> capacity) instead of just reducing capacity.

The question is whether you want that memory to be hidden from MemTotal
(carveout?) or treated just like allocated memory (allocation?).

If you treat the memmap as "just a memory allocation after early boot"
and "memap_on_memory" telling you to allocate that memory from the
hotplugged memory instead of the buddy, then "carveout"
might be more of an internal implementation detail to achieve that memory
allocation.


>>> stupid question - it sorta seems like you'd want this as the default
>>> setting for driver-managed hotplug memory blocks, but I suppose for
>>> very small blocks there's problems (as described in the docs).
>>
>> The issue is that it is per-memblock. So you'll never have 1 GiB ranges
>> of consecutive usable memory (e.g., 1 GiB hugetlb page).
>>
> 
> That makes sense, i had not considered this.  Although it only applies
> for small blocks - which is basically an indictment of this suggestion:
> 
> https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/
> 
> So I'll have to consider this and whether this should be a default.
> It's probably this is enough to nak this entirely.
> 
> 
> ... that said ....
> 
> Interestingly, when I tried allocating 1GiB hugetlb pages on a dax device
> in ZONE_MOVABLE (without memmap_on_memory) - the allocation fails silently
> regardless of block size (tried both 2GB and 256MB).  I can't find a reason
> why this would be the case in the existing documentation.

Right, it only currently works with ZONE_NORMAL, because 1 GiB pages are
considered unmovable in practice (try reliably finding a 1 GiB area to
migrate the memory to during memory unplug ... when these hugetlb things are
unswappable etc.).

I documented it under https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html

"Gigantic pages are unmovable, resulting in user space consuming a lot of unmovable memory."

If we ever support THP in that size range, we might consider them movable
because we can just split/swapout them when allcoating a migration target
fails.

> 
> (note: hugepage migration is enabled in build config, so it's not that)
> 
> If I enable one block (256MB) into ZONE_NORMAL, and the remainder in
> movable (with memmap_on_memory=n) the allocation still fails, and:
> 
>     nr_slab_unreclaimable 43
> 
> in node1/vmstat - where previously there was nothing.
> 
> Onlining the dax devices into ZONE_NORMAL successfully allowed 1GiB huge
> pages to allocate.
> > This used the /sys/bus/node/devices/node1/hugepages/* interfaces to test
> 
> Using the /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages with
> interleave mempolicy - all hugepages end up on ZONE_NORMAL.
> 
> (v6.13 base kernel)
> 
> This behavior is *curious* to say the least.  Not sure if bug, or some
> nuance missing from the documentation - but certainly glad I caught it.

See above :)

> 
> 
>> I thought we had that? See MHP_MEMMAP_ON_MEMORY set by dax/kmem.
>>
>> IIRC, the global toggle must be enabled for the driver option to be considered.
> 
> Oh, well, that's an extra layer I missed.  So there's:
> 
> build:
>    CONFIG_MHP_MEMMAP_ON_MEMORY=y
>    CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
> global:
>    /sys/module/memory_hotplug/parameters/memmap_on_memory
> device:
>    /sys/bus/dax/devices/dax0.0/memmap_on_memory
> 
> And looking at it - this does seem to be the default for dax.
> 
> So I can drop the existing `nuance movable/memmap` section and just
> replace it with the hugetlb subtleties x_x.
> 
> I appreciate the clarifications here, sorry for the incorrect info and
> the increasing confusing.


No worries. If you have ideas on what to improve in the memory hotplug
docs, please let me know!


-- 
Cheers,

David / dhildenb