[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7OWmDXEYhT0BB0X@gourry-fedora-PF4VCD3F>
Date: Mon, 17 Feb 2025 15:05:44 -0500
From: Gregory Price <gourry@...rry.net>
To: lsf-pc@...ts.linux-foundation.org
Cc: linux-mm@...ck.org, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
The story up to now
-------------------
When we left the driver arena, we had created a dax device - which
connects a Soft Reserved iomem resource to one or more `memory blocks`
via the kmem driver. We also discussed a bit about ZONE selection
and default online behavior.
In this section we'll discuss what actually goes into memory block
creation, how those memory blocks are exposed to kernel allocators
(tl;dr: sparsemem / memmap / struct page), and the implications of
the selected memory zones.
-------------------------------------
Step 7: Hot-(un)plug Memory (Blocks).
-------------------------------------
Memory hotplug refers to surfacing physical memory to kernel
allocators (page, slab, cache, etc) - as opposed to the action of
"physically hotplugging" a device into a system (e.g. USB).
Physical memory is exposed to allocators in the form of memory blocks.
A `memory block` is an abstraction to describe a physically
contiguous region memory, or more explicitly a collection of physically
contiguous page frames which is described by a physically contiguous
set of `struct page` structures in the system memory-map.
The system memmap is what is used for pfn-to-page (struct) and
page(struct)-to-pfn conversions. The system memmap has `flat` and
`sparse` modes (configured at build-time). Memory hotplug requires the
use of `sparsemem`, which aptly makes the memory map sparse.
Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove
an active memory block, the pages in-use must have their data (and
therefore mappings) migrated to another memory block. Hot-remove must
be specifically enabled separate from hotplug.
Build configurations affecting memory block hot(un)plug
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
CONFIG_SPARSEMEM
CONFIG_64BIT
CONFIG_MEMORY_HOTPLUG
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
CONFIG_MHP_MEMMAP_ON_MEMORY
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
CONFIG_MIGRATION
CONFIG_MEMORY_HOTREMOVE
During early-boot, the kernel finds all SystemRAM memory regions NOT
marked "Special Purpose" and will create memory blocks for these
regions by default. These blocks are defaulted into ZONE_NORMAL
(more on zones shortly).
Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks
created and hot-plugged by drivers. The same mechanism is used to
hot-add memory physically hotplugged after system boot (i.e. not present
in the EFI Memory Map at boot time).
The DAX/KMEM driver hotplugs memory blocks via the
`add_memory_driver_managed()`
function.
-------------------------------
Step 8: Page Struct allocation.
-------------------------------
A `memory block` is made up of a collection of physical memory pages,
which must have entries in the system Memory Map - which is managed by
sparsemem on systems with memory (block) hotplug. Sparsemem fills the
memory map with `struct page` for hot-plugged memory.
Here is a rough trace through the (current) stack on how page structs
are populated into the system Memory Map on hotplug.
```
add_memory_driver_managed
add_memory_resource
memblock_add_node
arch_add_memory
init_memory_mapping
add_pages
__add_pages
sparse_add_section
section_activate
populate_section_memmap
__populate_section_memmap
memmap_alloc
memblock_alloc_try_nid_raw
memblock_alloc_internal
memblock_alloc_range_nid
kzalloc_node(..., GFP_KERNEL, ...)
```
All allocatable-memory requires `struct page` resources to describe the
physical page state. On a system with regular 4kb size pages and 256GB
of memory - 4GB is required just to describe/manage the memory.
This is ~1.5% of the new capacity to just surface it (4/256).
This becomes an issue if the memory is not intended for kernel-use,
as `struct page` memory must be allocated in non-movable, kernel memory
`zones`. If hot-plugged capacity is designated for a non-kernel zone
(ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient
ZONE_NORMAL (or similar kernel-compatible zone) to allocate from.
Matthew Wilcox has a plan to reduce this cost, some details of his plan:
https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/
https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/
---------------------
Step 9: Memory Zones.
---------------------
We've alluded to "Memory Zones" in prior sections, with really the only
detail about these concepts being that there are "Kernel-allocation
compatible" and "Movable" zones, as well as some relationship between
memory blocks and memory zones.
The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
For the purpose of this reading we'll consider two basic use-cases:
- memory block hot-unplug
- kernel resource allocation
You can (for the most part) consider these cases incompatible. If the
kernel allocates `struct page` memory from a block, then that block cannot
be hot-unplugged. This memory is typically unmovable (cannot be migrated),
and its pages unlikely to be removed from the memory map.
There are other scenarios, such as page pinning, that can block hot-unplug.
The individual mechanisms preventing hot-unplug are less important than
their relationship to memory zones.
ZONE_NORMAL basically allows any allocations, including things like page
tables, struct pages, and pinned memory.
ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
The kernel and privileged users can cause long-term pinning to occur -
even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
hot-unplug-ability under normal conditions.
Here's the take-away:
Any capacity marked SystemRAM but not Special Purpose during early boot
will be onlined into ZONE_NORMAL by default - making it available for
kernel-use during boot. There is no guarantee of being hot-unpluggable.
Any capacity marked Special Purpose at boot, or hot-added (physically),
will be onlined into a user-selected zone (Normal or Movable).
There are (at least) 4 ways to select what zone to online memory blocks.
Build Time:
CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
Boot Time:
memhp_default_state (boot parameter)
udev / daxctl:
user policy explicitly requesting the zone
memory sysfs
online_movable > /sys/bus/memory/devices/memoryN/online
------------------------------------------
Nuance: memmap_on_memory and ZONE_MOVABLE.
------------------------------------------
As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity
will consume ZONE_NORMAL capacity for its kernel resources. This can
be problematic if vast amounts of ZONE_MOVABLE is added on a system
with limited ZONE_NORMAL capacity.
For example, consider a system with 4GB of ZONE_NORMAL and 256GB of
ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would
be consumed to allocate `struct page` resources for the ZONE_MOVABLE
capacity - leaving no working memory for the rest of the kernel.
The `memmap_on_memory` configuration option allows for hotplugged memory
blocks to host their own `struct page` allocations...
if they're placed in ZONE_NORMAL.
To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`.
Sparsemem allocation of memory map resources ultimately uses a
`kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with
a *suggested* node.
```
memmap_alloc
memblock_alloc_try_nid_raw
memblock_alloc_internal
memblock_alloc_range_nid
kzalloc_node(..., GFP_KERNEL, ...)
```
The node ID passed in as an argument is a "preferred node", which means
is insufficient space on that node exists to service the GFP_KERNEL
allocation, it will fall back to another node.
If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
1) A portion of the memory block is carved out for to allocate memmap
data (reducing usable size by 64b*nr_pages)
2) The memory is allocated on ZONE_NORMAL on another node..
Result: Lost capacity due to the unused carve-out area for no value.
--------------------------------
The Complexity Story up til now.
--------------------------------
Platform and BIOS:
May configure all the devices prior to kernel hand-off.
May or may not support reconfiguring / hotplug.
BIOS and EFI:
EFI_MEMORY_SP - used to defer management to drivers
Kernel Build and Boot:
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
CONFIG_SPARSEMEM
CONFIG_64BIT
CONFIG_MEMORY_HOTPLUG
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
CONFIG_MHP_MEMMAP_ON_MEMORY
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
CONFIG_MIGRATION
CONFIG_MEMORY_HOTREMOVE
CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
nosoftreserve - Will always result in CXL as SystemRAM
kexec - SystemRAM configs carry over to target
memory_hotplug.memmap_on_memory
Driver Build Options Required
CONFIG_CXL_ACPI
CONFIG_CXL_BUS
CONFIG_CXL_MEM
CONFIG_CXL_PCI
CONFIG_CXL_PORT
CONFIG_CXL_REGION
CONFIG_DEV_DAX
CONFIG_DEV_DAX_CXL
CONFIG_DEV_DAX_KMEM
User Policy
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
memhp_default_state (boot param)
daxctl online-memory daxN.Y (userland)
Nuances
Early-boot resource re-use
Memory Block Alignment
memmap_on_meomry + ZONE_MOVABLE
----------------------------------------------------
Next up:
RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
Interleave - RAS and Region Management
~Gregory
Powered by blists - more mailing lists