[<prev] [next>] [day] [month] [year] [list]
Message-ID: <aXErSP1zKnv4koKG@gourry-fedora-PF4VCD3F>
Date: Wed, 21 Jan 2026 14:38:48 -0500
From: Gregory Price <gourry@...rry.net>
To: linux-cxl@...r.kernel.org
Cc: dan.j.williams@...el.com, dave.jiang@...el.com,
jonathan.cameron@...wei.com, alison.schofield@...el.com,
ira.weiny@...el.com, dave@...olabs.net,
linux-kernel@...r.kernel.org, gourry@...rry.net,
kernel-team@...a.com, vishal.l.verma@...el.com, david@...nel.org,
benjamin.cheatham@....com
Subject: cxl/region.c improvements and DAX/Hotplug plumbing
Jonathan asked me to summarize my roadmap/thoughts, so below is the
gist of it: observations, high level design details, some patches.
@David (Hildrenbrand): I CC'd you due to DAX and MHP discussion.
My larger motivation is identifying and solving friction between
different use-cases trying to leverage a single backend: DAX
My personal motivation is to drive towards a defensible abstraction
for CXL-backed N_PRIVATE_MEMORY nodes.
~Gregory
<great wall of text>
=================================================================
TL;DR:
The current CXL-DAX glue is not flexible enough for everyone.
Lets treat dax_region as a specific mode of operation, and
offer additional x_region modes as "region drivers" with
their own policies on how to handle the memory capacity.
=================================================================
Overview
========
To me - this appears to all intersect at drivers/cxl/core/region.c
Today: regionN/ has a static "backend interface" when it's created.
for ram_region - DAX (or NONE for BIOS config SysRAM)
for pmem_region - NVDIMM
for dc_region - DAX? (not upstream/settled)
The DAX plumbing (as-is) lacks some flexibility to handle multiple
use-cases - especially DCD and some accelerator features.
For current-use, the DAX glue (dax_region) has some rough edges:
- per-region dax-driver preference not plumbed (kmem, devdax, fsdax)
- per-region auto online preference not plumbed (kmem)
- per-region hotplug protection not plumbed (memblock races)
- DCD implied sparseness (runtime allocations) - but no sparse-DAX.
- DCD Tags may be needed by user software, but no ABI (dax/uuid?).
- DCD Tags imply consumption policy, but no infrastructure.
- Onlining as NUMA is all or nothing:
Whole system gets access or driver has to write own mm services
Some policy above can be mutually exclusive.
Example: Can't have driver-wide auto-probe() auto-online policy on
systems with multiple devices using the dax glue.
This drives userland complexity - tools have to understand
multiple subsystems.
Example: Tag-consumption policies may differ between use cases.
sysram - might ignore
famfs - might use as filesystem information
virtio - might use as routing info for target VM
So putting auto-configured regions aside (tl;dr: BIOS people pls no),
the core proposal is to formalize:
cxl_region.region_driver
that encodes some of this common policy.
Some of the simple backends might be userland exposed, for example:
cxl create-region --driver=sysram --auto-online=online_movable
cxl create-region --driver=dax --daxdev=fsdax
Some of these backends might be intended for building device drivers
my_famfs_driver.c
cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
/* May use tags as filesystem information */
(Don't worry John - just an example, not prescriptive)
my_virtio_driver.c
cxl_create_dcd_region(..., my_dcd_hotplug_callbacks);
/* May use tags for routing capacity to VMs */
my_accelerator_driver.c
cxl_create_private_region(..., my_callbacks, NODE_TYPE_ACCEL);
/* Wants memory as a NUMA node, but isolated from allocations */
This also encourages some amount of code re-use:
the core sysram driver can be the same for static-regions and
dcd, but dcd calls hotplug()/unplug() functions at runtime.
It also encourages upstreaming/specification of some operations.
My list of current discrete steps (some serial, some parallel):
1) Internally formalize cxl_region.region_driver (no ABI exposure)
3) Plumb additional information through to DAX based on driver
- dax-driver mode preference
- uuid for tagged capacity
2) Create explicit sysram_driver
- Write in terms of DCD
- Tagged Extents: use DAX glue to manage set of tagged extents
- Untagged Extents: Hotplug and manage directly
- new ABI: `region0/region_driver` - switch between [dax,sysram]
4) Plumb additional hotplug policy from CXL into DAX and MHP
- dax0.0/hotplug (atomic operation on all blocks)
- cxl region auto-online policy (region0/rctl/auto-online)
- block-protection policy? (memory_notifier controls)
- hiding memory blocks? (discussed in last meeting)
- ABI: `region0/rctl/*` controls
5) Formalize DCD dax_region driver use
- each extent list = new dax device in devdax mode
- tags enforced to be globally unique
- dax_region.add_extents(tag, extent_list)
-> create new daxN.0
-> expose daxN.0/uuid
- dax_region.remove_extents(extent_list)
- dax_region.remove_tagged_extents(tag)
6) Formalize DCD sysram_region driver use
- sysram_region.add_extents(tag, extent_list)
-> untagged capacity managed as individual memory blocks
-> tagged capacity managed with DAX glue
- sysram_region.remove_extents(extent_list) (untagged)
- sysram_region.remove_tagged_extents(tag) (tagged)
7) Add private_region infrastructure
- private_region driver design
- N_PRIVATE_MEMORY infrastructure
- derivative driver (in my case compressed memory)
- Probably wants memory_blocks hiding and/or retricted operations
========================================================
Specific problem descriptions and ABI/NDCTL implications
========================================================
--------------------------------
Problem: Per-region usage policy
--------------------------------
Use-case-driven requirements are testing the limits of the existing
region driver and dax integration designs, and encoding the policies
related to them in region.c is going to get cumbersome.
Use-case 1: Static Volatile RAM (none, dax_region w/ single dax dev)
Use-case 2: Static PMEM (pmem_region/NVDIMM)
Use-case 3: DCD SysRAM (sysram_region w/ hotplug)
Use-case 4: SP Anon Memory (compressed_region - private_region)
Use-case 5: Static FAMFS Region (dax_region w/ single daxdev)
Use-case 6: DCD FAMFS Region (dax_region w/ multi-daxdev)
Use-case 7: Accelerator Memory (private_region)
"Private" here means exposure to rest of the system is driver-defined
but there may be re-usable infrastructure.
The CXL driver is the right place to expose the region driver choice.
- Users use common memory region types (sysram, dax) from ABI/CLI.
- Device drivers can register a region types w/ default operations.
- Special devices implement advanced usage policy w/ private region.
Solution: Discrete regionN backend drivers, I list some above.
(none) - Static SYSRAM Region setup by BIOS
DAX - has multiple modes (devdax, kmem, fsdax)
sysram - Dynamic SYSRAM region w/ more functionality
private - integrate w/ N_PRIVATE_MEMORY infrastructure
Region drivers re-used for multiple region types (e.g. ram vs dcd).
- ram_region w/ sysram driver calls add()/remove() at setup/teardown.
- dc_region w/ sysram driver calls add()/remove() at runtime.
ABI: (RW) regionN/region_driver
Read: Displays what region driver is assigned
Write: Changing an uncommitted region's underlying driver
ABI: regionN/rctl/*
Exposes region_driver specific controls / information
example: auto-online policy for sysram_region
ndctl extesion:
cxl create-region --driver=_____
Starting Patch Link:
https://lore.kernel.org/linux-cxl/20260113202138.3021093-1-gourry@gourry.net/
---------------------------------------------------------
Problem: SysRAM Auto-Hotplug policy is too broadly scoped
---------------------------------------------------------
Hotplug SYSRAM indirection through DAX leads to complex auto-online
interactions and/or current policy options are too broad in scope.
(e.g. MHP_AUTO_ONLINE build option is bad cross-platform)
Solution 1: Plumb auto-online policy from cxl_region into dax_kmem
Build Options:
Default auto-online policy for auto-regions?
Moves scope from MHP-Global to CXL-local
ABI: dax_region - regionN/rctl/auto-online
Gives the region creator a chance to define before probe()
Solution 2: Make a dedicated sysram_region with policy
May want both solutions longer term (for tagged DCD capacity)
ndctl extension:
cxl create-region --driver=sysram --auto-online=movable ?
---------------------------------------------
Annoyance: DAX driver binding could be easier
---------------------------------------------
dax_region encodes a default dax device type
- RAM wants kmem
- other users might want fsdax, devdax
- Other tools can bind the wrong driver
If your DAX use-case is not the default, more setup steps required.
Solution:
Plumb dax driver default / restriction from cxl_region through to
DAX. Disallow bind-operation (-ENOSUPP) based on that policy.
We can't prevent unbind, but we can prevent bad-bind.
ndctl extension:
cxl create-region --driver=dax --daxmode=[devdax,kmem,...]
Backward Compatibility:
The current ndctl w/o new args would essentially be
cxl create-region --driver=dax --daxmode=devdax
And all the follow up operations would work as-is.
---------------------------------------------------------------
Problem/Annoyance: DAX kmem per-block operation race conditions
---------------------------------------------------------------
DAX exposes SYSRAM regions as individual memory blocks, which
creates race conditions when trying to manage a set of blocks.
Example: udev can have an auto-onlining policy that twiddles
memory_block bits while cxl driver is trying to unplug.
Affects: DCD, SysRAM, potentially N_PRIVATE_MEMORY
Solution 1: [unplug, online, online_movable] > dax0.0/hotplug
Does operation on all blocks under the hotplug lock.
Solution 2: dedicated sysram_region driver w/ or w/o DAX.
Can support sparseness w/o DAX (see DCD problem)
Could use DAX for tagged DCD regions.
Tradeoff: May duplicate some DAX logic.
Solution 3: Hide nodeN/memory_block's w/ MHP Flag.
Issue: Possibly userland breaking.
Solution 4: Prevent non-driver actions from changing state.
Also solves hotplug protection problem (see next)
Patch: Implements solution 1
https://lore.kernel.org/linux-cxl/20260114235022.3437787-5-gourry@gourry.net/
--------------------------------------------------------------
Problem: SYSRAM or N_PRIVATE want memory_block policy controls
--------------------------------------------------------------
A SYSRAM or N_PRIVATE region may have an implied zone-policy to
protect - or N_PRIVATE blocks may want to restrict any operation.
Privileged userspace action could do this:
cat memoryN/state => online_movable
cat memoryN/valid_zones => movable
echo offline > memoryN/state => offline
echo online > memoryN/state => online
cat memoryN/valid_zones => normal
- A DCD driver wants to try to protect hotpluggability.
- userspace has no business twiddling private_region blocks.
Solution: Prevent non-driver actions from changing state.
Essentially, add memory_notifier to region_driver or DAX
that rejects operations according to driver-defined policy.
May not require explicit, could be encoded in default region
driver policy (e.g. dcd implies protection).
Example Patch:
https://lore.kernel.org/linux-cxl/20260114235022.3437787-6-gourry@gourry.net/
-----------------------------------------------------
Problem: DCD Tags are confusing and make people angry
-----------------------------------------------------
DCD untagged extent sets are confusing and make people angry.
DCD tagged extent sets are confusing and make people angry.
Solution: Per region_driver policy
Example 1: SysRAM
Linux cares about memory-block aligned contiguous chunks.
Everything else is basically an opini... policy.
My opinion
----------
Untagged extents:
Managed individually, and doesn't need a DAX device to
online (hotplug directly from sysram_region.c).
May be sparse.
Even if arrive together, may be released separate.
Tagged extents have two options:
Manage set of extents as a collective block: dax0.0/hotplug
Example 2: DAX (FAMFS)
Tags may actually mean something.
Linux should enforce globally unique tags per set of extents.
Each tagged set of extents comes/goes collectively.
Sparseness not allowed
set(A) and set(B) have unique tags
set(N) arrives together w/ MORE=1 set in logs.
Each tagged set is exposed as a separate dax device.
DAX likely requires a dax0.0/uuid to provide consumers info.
Example 3: virtio
Tags may imply destination VM capacity
In this case, a tag is essentially just routing data.
TL;DR:
Implementing region_drivers lets us break up the tag debates
into discrete use-case silos.
---------------------------------------------
Problem: "Special" Device memory usage policy
---------------------------------------------
Memory devices may have special features that dictate use patterns.
They may also prefer using core mm/ services for basic operation.
(page_alloc, reclaim, migration, etc)
But: This memory shouldn't be exposed as "Normal System RAM".
Solution: N_PRIVATE_MEMORY node_state
CXL Driver Piece: private_region driver
These drivers would know how to register N_PRIVATE_MEMORY
Would also allow device-specific usage behavior to be written.
Would likely be used by upper layer drivers rather than uapi.
Example: Compressed Memory
general service can use page_alloc() for get_page_from_freelist()
region_driver registers memory on a compressed memory node
vmscan.c/memory-tiers.c calls back to driver to handle migration
Example: Accelerator Memory Region
Accel library/drive does node-based allocs.
Driver callbacks might include write-faults (ZONE_DEVICE-esque
pattern that passes page ownership between CPU/GPU)
Either way, driver applies mapping policy w/o accounting cargo
Example: Slow(er) memory
Some memory is "just memory", but might be particularly slow and
intended for use as a filesystem backend or as only a demotion
target. Otherwise its allocated / mapped like any other memory,
but it still required isolation so isolated to the demotion path
and not a fallback allocation target
Driver basically say: kernel should prefer reclaim over fallback.
Benefits:
Simplifies driver design.
Encourages upstreaming common operations as new spec extentions.
Keeps device policy out of mm/
ABI: region/rdrv/* (maybe?)
More likely something like vendors just build derivative drivers:
driver/[common_use]/[vendor]/my_driver.c
#include linux/cxl.h
If cxl decoders involved, common driver can programs and make the
private_memory region, device-driver provides relevant callbacks
for the N_PRIVATE_MEMORY infrastructure.
If decoders programming not involved, device can call private node
infrastructure directly and omit cxl-patterns.
RFC:
https://lore.kernel.org/linux-cxl/20260108203755.1163107-1-gourry@gourry.net/
================================================
</great wall of text>
Powered by blists - more mailing lists