[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkq6Me6nRaL6b09YxJ_nFkxb+n+M3-q_aJwOs2ZO4q8VCg@mail.gmail.com>
Date: Tue, 18 Feb 2025 09:49:28 -0800
From: Yang Shi <shy828301@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: lsf-pc@...ts.linux-foundation.org, linux-mm@...ck.org,
linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: CXL Boot to Bash - Section 3: Memory (block) Hotplug
On Mon, Feb 17, 2025 at 12:05 PM Gregory Price <gourry@...rry.net> wrote:
>
>
> The story up to now
> -------------------
> When we left the driver arena, we had created a dax device - which
> connects a Soft Reserved iomem resource to one or more `memory blocks`
> via the kmem driver. We also discussed a bit about ZONE selection
> and default online behavior.
>
> In this section we'll discuss what actually goes into memory block
> creation, how those memory blocks are exposed to kernel allocators
> (tl;dr: sparsemem / memmap / struct page), and the implications of
> the selected memory zones.
>
>
> -------------------------------------
> Step 7: Hot-(un)plug Memory (Blocks).
> -------------------------------------
> Memory hotplug refers to surfacing physical memory to kernel
> allocators (page, slab, cache, etc) - as opposed to the action of
> "physically hotplugging" a device into a system (e.g. USB).
>
> Physical memory is exposed to allocators in the form of memory blocks.
>
> A `memory block` is an abstraction to describe a physically
> contiguous region memory, or more explicitly a collection of physically
> contiguous page frames which is described by a physically contiguous
> set of `struct page` structures in the system memory-map.
>
> The system memmap is what is used for pfn-to-page (struct) and
> page(struct)-to-pfn conversions. The system memmap has `flat` and
> `sparse` modes (configured at build-time). Memory hotplug requires the
> use of `sparsemem`, which aptly makes the memory map sparse.
>
> Hot *remove* (un-plug) is distinct from Hot add (plug). To hot-remove
> an active memory block, the pages in-use must have their data (and
> therefore mappings) migrated to another memory block. Hot-remove must
> be specifically enabled separate from hotplug.
>
>
> Build configurations affecting memory block hot(un)plug
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
> CONFIG_SPARSEMEM
> CONFIG_64BIT
> CONFIG_MEMORY_HOTPLUG
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
> CONFIG_MHP_MEMMAP_ON_MEMORY
> CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> CONFIG_MIGRATION
> CONFIG_MEMORY_HOTREMOVE
>
> During early-boot, the kernel finds all SystemRAM memory regions NOT
> marked "Special Purpose" and will create memory blocks for these
> regions by default. These blocks are defaulted into ZONE_NORMAL
> (more on zones shortly).
>
> Memory regions present at boot marked `EFI_MEMORY_SP` have memory blocks
> created and hot-plugged by drivers. The same mechanism is used to
> hot-add memory physically hotplugged after system boot (i.e. not present
> in the EFI Memory Map at boot time).
>
> The DAX/KMEM driver hotplugs memory blocks via the
> `add_memory_driver_managed()`
> function.
>
>
> -------------------------------
> Step 8: Page Struct allocation.
> -------------------------------
> A `memory block` is made up of a collection of physical memory pages,
> which must have entries in the system Memory Map - which is managed by
> sparsemem on systems with memory (block) hotplug. Sparsemem fills the
> memory map with `struct page` for hot-plugged memory.
>
> Here is a rough trace through the (current) stack on how page structs
> are populated into the system Memory Map on hotplug.
>
> ```
> add_memory_driver_managed
> add_memory_resource
> memblock_add_node
> arch_add_memory
> init_memory_mapping
> add_pages
> __add_pages
> sparse_add_section
> section_activate
> populate_section_memmap
> __populate_section_memmap
> memmap_alloc
> memblock_alloc_try_nid_raw
> memblock_alloc_internal
> memblock_alloc_range_nid
> kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> All allocatable-memory requires `struct page` resources to describe the
> physical page state. On a system with regular 4kb size pages and 256GB
> of memory - 4GB is required just to describe/manage the memory.
>
> This is ~1.5% of the new capacity to just surface it (4/256).
>
> This becomes an issue if the memory is not intended for kernel-use,
> as `struct page` memory must be allocated in non-movable, kernel memory
> `zones`. If hot-plugged capacity is designated for a non-kernel zone
> (ZONE_MOVABLE, ZONE_DEVICE, etc), then there must be sufficient
> ZONE_NORMAL (or similar kernel-compatible zone) to allocate from.
>
> Matthew Wilcox has a plan to reduce this cost, some details of his plan:
> https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/
> https://lore.kernel.org/all/Z37pxbkHPbLYnDKn@casper.infradead.org/
>
>
> ---------------------
> Step 9: Memory Zones.
> ---------------------
> We've alluded to "Memory Zones" in prior sections, with really the only
> detail about these concepts being that there are "Kernel-allocation
> compatible" and "Movable" zones, as well as some relationship between
> memory blocks and memory zones.
>
> The two zones we really care about are `ZONE_NORMAL` and `ZONE_MOVABLE`.
>
> For the purpose of this reading we'll consider two basic use-cases:
> - memory block hot-unplug
> - kernel resource allocation
>
> You can (for the most part) consider these cases incompatible. If the
> kernel allocates `struct page` memory from a block, then that block cannot
> be hot-unplugged. This memory is typically unmovable (cannot be migrated),
> and its pages unlikely to be removed from the memory map.
>
> There are other scenarios, such as page pinning, that can block hot-unplug.
> The individual mechanisms preventing hot-unplug are less important than
> their relationship to memory zones.
>
> ZONE_NORMAL basically allows any allocations, including things like page
> tables, struct pages, and pinned memory.
>
> ZONE_MOVABLE, under normal conditions, disallows most kernel allocations.
>
> ZONE_MOVABLE does NOT make a *strong* guarantee of hut-unplug-ability.
> The kernel and privileged users can cause long-term pinning to occur -
> even in ZONE_MOVABLE. It should be seen as a best-attempt at providing
> hot-unplug-ability under normal conditions.
>
>
> Here's the take-away:
>
> Any capacity marked SystemRAM but not Special Purpose during early boot
> will be onlined into ZONE_NORMAL by default - making it available for
> kernel-use during boot. There is no guarantee of being hot-unpluggable.
>
> Any capacity marked Special Purpose at boot, or hot-added (physically),
> will be onlined into a user-selected zone (Normal or Movable).
>
> There are (at least) 4 ways to select what zone to online memory blocks.
>
> Build Time:
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> Boot Time:
> memhp_default_state (boot parameter)
> udev / daxctl:
> user policy explicitly requesting the zone
> memory sysfs
> online_movable > /sys/bus/memory/devices/memoryN/online
>
>
> ------------------------------------------
> Nuance: memmap_on_memory and ZONE_MOVABLE.
> ------------------------------------------
> As alluded to in the prior sections - hot-added ZONE_MOVABLE capacity
> will consume ZONE_NORMAL capacity for its kernel resources. This can
> be problematic if vast amounts of ZONE_MOVABLE is added on a system
> with limited ZONE_NORMAL capacity.
>
> For example, consider a system with 4GB of ZONE_NORMAL and 256GB of
> ZONE_MOVABLE. This wouldn't work, as the entirety of ZONE_NORMAL would
> be consumed to allocate `struct page` resources for the ZONE_MOVABLE
> capacity - leaving no working memory for the rest of the kernel.
>
> The `memmap_on_memory` configuration option allows for hotplugged memory
> blocks to host their own `struct page` allocations...
>
> if they're placed in ZONE_NORMAL.
>
> To enable, use the boot param: `memory_hotplug.memmap_on_memory=1`.
>
> Sparsemem allocation of memory map resources ultimately uses a
> `kzalloc_node` call, which simply allocates memory from ZONE_NORMAL with
> a *suggested* node.
>
> ```
> memmap_alloc
> memblock_alloc_try_nid_raw
> memblock_alloc_internal
> memblock_alloc_range_nid
> kzalloc_node(..., GFP_KERNEL, ...)
> ```
>
> The node ID passed in as an argument is a "preferred node", which means
> is insufficient space on that node exists to service the GFP_KERNEL
> allocation, it will fall back to another node.
>
> If all hot-plugged memory is added to ZONE_MOVABLE, two things occur:
>
> 1) A portion of the memory block is carved out for to allocate memmap
> data (reducing usable size by 64b*nr_pages)
>
> 2) The memory is allocated on ZONE_NORMAL on another node..
Nice write-up, thanks for putting everything together. A follow up
question on this. Do you mean the memmap memory will show up as a new
node with ZONE_NORMAL only besides other hot-plugged memory blocks? So
we will actually see two nodes are hot-plugged?
Thanks,
Yang
>
> Result: Lost capacity due to the unused carve-out area for no value.
>
> --------------------------------
> The Complexity Story up til now.
> --------------------------------
> Platform and BIOS:
> May configure all the devices prior to kernel hand-off.
> May or may not support reconfiguring / hotplug.
>
> BIOS and EFI:
> EFI_MEMORY_SP - used to defer management to drivers
>
> Kernel Build and Boot:
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG
> CONFIG_SPARSEMEM
> CONFIG_64BIT
> CONFIG_MEMORY_HOTPLUG
> CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_AUTO
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE
> CONFIG_MHP_MEMMAP_ON_MEMORY
> CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> CONFIG_MIGRATION
> CONFIG_MEMORY_HOTREMOVE
> CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
> nosoftreserve - Will always result in CXL as SystemRAM
> kexec - SystemRAM configs carry over to target
> memory_hotplug.memmap_on_memory
>
> Driver Build Options Required
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
>
> User Policy
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
> memhp_default_state (boot param)
> daxctl online-memory daxN.Y (userland)
>
> Nuances
> Early-boot resource re-use
> Memory Block Alignment
> memmap_on_meomry + ZONE_MOVABLE
>
> ----------------------------------------------------
> Next up:
> RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
> Interleave - RAS and Region Management
>
> ~Gregory
>
Powered by blists - more mailing lists