[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260123045733.6954-1-ravis.opensrc@gmail.com>
Date: Thu, 22 Jan 2026 20:57:23 -0800
From: Ravi Jonnalagadda <ravis.opensrc@...il.com>
To: damon@...ts.linux.dev,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org
Cc: sj@...nel.org,
akpm@...ux-foundation.org,
corbet@....net,
bijan311@...il.com,
ajayjoshi@...ron.com,
Ravi Jonnalagadda <ravis.opensrc@...il.com>
Subject: [RFC PATCH 0/5] mm/damon: Add node_sys_bp quota goal metric for PA-based migration control
This series introduces a new DAMON quota goal metric, `node_sys_bp`, designed
for controlling memory migration in heterogeneous memory systems (e.g.,
DRAM↔CXL tiering). These patches are provided as an initial RFC and have
not been tested on actual hardware.
Background and Motivation
=========================
A previous patch series [1] by Bijan Tabatabai and myself added weighted
interleave support for DAMON migrate_{hot,cold} actions. That series
implemented the feature for vaddr (virtual address) schemes because the
weight-based approach requires VMA offset information to determine target
nodes:
target_node = (vma_offset % total_weight) → node_from_weights
For paddr (physical address) schemes, obtaining VMA offset requires costly
rmap (reverse mapping) walks. As noted in that series:
"However, finding out how a folio is mapped inside of a VMA requires
a costly rmap walk when using a paddr scheme. As such, we have
decided that this functionality makes more sense as a vaddr scheme."
This new series takes a different approach that enables PA-based migration
WITHOUT requiring rmap walks, by using basis points (bp) target-state goals
instead of weight-based action rates.
The rmap Cost vs Oscillation Trade-off
======================================
For PA-based migration with weights, there are two possibilities:
1. Weight-based with rmap:
- Use rmap to find VMA offset for each physical page
- Apply weights based on VMA offset (same algorithm as VA)
- Works correctly: VMA offset provides stable identity
- Problem: rmap walks are expensive for every migration candidate
2. Weight-based without rmap:
- Attempt to apply weights using only physical address information
- No stable identity across migrations
- Results in page oscillation (see below)
- Not viable
The Oscillation Problem (Weights Without rmap)
==============================================
Weight-based migration relies on a stable identifier to determine which node
a page "belongs to". For VA, this is the VMA offset - it remains constant
regardless of which physical node backs the page. For PA without rmap, no
such stable identifier exists.
Consider a two-node system with weights {Node 0: 40, Node 1: 60}:
Initial state:
Hot pages on Node 0: PFN 0x1000, 0x1001, 0x1002, 0x1003, 0x1004
Node 1: empty
Iteration 1 - Weight-based selection (no rmap):
System tries to achieve 40/60 distribution
Selects pages at PFN 0x1002, 0x1003, 0x1004 (60%) for Node 1
After migration, these pages get NEW PFNs on Node 1:
PFN 0x1002 → PFN 0x5000 (Node 1)
PFN 0x1003 → PFN 0x5001 (Node 1)
PFN 0x1004 → PFN 0x5002 (Node 1)
State after Iteration 1:
Node 0: PFN 0x1000, 0x1001 (40%)
Node 1: PFN 0x5000, 0x5001, 0x5002 (60%)
Iteration 2 - Weight-based selection runs again:
System sees pages at PFN 0x5000, 0x5001, 0x5002 on Node 1
These are "new" pages from the system's perspective
NO MEMORY that these were just migrated FROM Node 0
Weight-based logic may select some for migration back to Node 0
Iteration 3, 4, 5...:
Same pages continue bouncing between nodes
Each migration changes the PFN, erasing any "history"
System never converges to stable state
The fundamental issue: weights define an ACTION RATE ("migrate X% of
candidate pages to each node") rather than a TARGET STATE. Without stable
page identity (which rmap provides via VMA offset), the system cannot
determine which pages have already been "handled" and continues to
reprocess the same logical pages indefinitely.
With rmap, the VMA offset provides stable identity - a page at file offset
0x1000 always hashes to the same target node regardless of its current PFN.
Without rmap, we have no such anchor, and weights become meaningless.
Solution: bp-Based Target State Goals
=====================================
Instead of specifying action rates, `node_sys_bp` specifies a TARGET STATE:
"Node N should contain X basis points (X/10000) of system memory"
The migration control loop:
1. Calculate current_bp: sum bytes per node across monitored regions
2. Compare: if current_bp >= target_bp, STOP (goal satisfied)
3. Otherwise: continue migrating toward target
Example with target: "Node 0 should have 4000 bp (40%)"
Iteration 1:
current_bp = 10000 (100% on Node 0)
target_bp = 4000 (40%)
current > target → migrate cold pages away from Node 0
After Iteration 1:
current_bp = 4000 (40% on Node 0)
Iteration 2:
current_bp = 4000
target_bp = 4000
current >= target → STOP, goal satisfied
No oscillation - migration stops when target state is reached.
No page identity tracking needed because we measure the END STATE, not
which specific pages were moved. The early-exit prevents oscillation by
stopping when the goal is met.
Why get_goal_metric() Ops Callback
==================================
The bp calculation requires iterating over monitored PA regions:
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
if (page_to_nid(pfn_to_page(pfn)) == nid)
node_bytes += PAGE_SIZE;
}
bp = node_bytes * 10000 / system_total;
This requires address-space knowledge that only the ops provider has.
Existing goal metrics (PSI, node_mem_*, node_memcg_*) are computed in
core using system-wide data that doesn't require iterating monitored
regions.
The new `get_goal_metric()` callback in `damon_operations` allows:
1. Core to remain generic - handles all common metrics
2. Ops providers to implement metrics requiring region iteration
3. Clean separation - PA implements node_sys_bp, VA could add
different metrics in future
4. Optional implementation - ops return 0 if metric unsupported
This design ensures node_sys_bp is only computed when using PA contexts
where it makes sense, while keeping the core quota goal infrastructure
unchanged for other metrics and ops providers.
The callback pattern allows each ops provider to implement metrics
specific to its address space model without burdening the core with
ops-specific knowledge.
Advantages of PA-Based Migration
================================
PA-based migration with DAMON enables integration of multiple hotness
sources for migration decisions:
1. DAMON's native access pattern monitoring
2. Fault-based information (similar to NUMA Balancing)
3. Future: Hardware monitoring units (e.g., CXL CHMU)
4. Future: Instruction-Based Sampling (AMD IBS, Intel PEBS)
Unlike VA-based approaches tied to individual process address spaces, PA
monitoring can aggregate hotness information from multiple sources to make
system-wide migration decisions across the entire physical memory space.
Complementary to Existing vaddr Migration
=========================================
This series complements rather than replaces the vaddr weighted interleave
migration merged in 6.18:
vaddr migration (weight-based):
- Per-process control
- Fine-grained interleave patterns via VMA offset
- Deterministic placement based on weights
paddr migration (bp-based, this series):
- System-wide control
- Target-state goals for node capacity management
- No rmap overhead
- Aggregates multiple hotness sources
Capacity Clamping
=================
The series also implements capacity clamping for `node_sys_bp` goals. In a
system where Node 0 has 40% of total RAM, setting a target of 50% is
impossible. The implementation clamps:
effective_target = min(user_target, node_capacity_bp)
This prevents the quota auto-tuning from chasing impossible targets and
avoids thrashing in two-context DRAM↔CXL setups.
Patches
=======
1/5: mm/damon/core: add DAMOS_QUOTA_NODE_SYS_BP metric
Adds the enum value and documentation.
2/5: mm/damon: add get_goal_metric() op and PA provider
Introduces the ops callback and PA implementation that iterates
monitored regions to calculate node_sys_bp without rmap.
3/5: mm/damon/core: add new ops-specific goal metric
Wires the new metric into core's quota goal evaluation, delegating
to ops.get_goal_metric() for DAMOS_QUOTA_NODE_SYS_BP.
4/5: mm/damon/paddr: capacity clamp and directional early-exit
Adds capacity clamping and early-exit logic to prevent migration
when goal is already satisfied.
5/5: mm/damon/sysfs-schemes: accept "node_sys_bp" in goal's target_metric
Exposes the new metric to userspace via sysfs.
Status
======
This is an early RFC for design review. The patches:
- Compile successfully with no errors or warnings
- Have NOT been tested on actual hardware
Feedback on the overall approach and design is appreciated.
References
==========
[1] mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions (v4)
https://lore.kernel.org/linux-mm/20250709005952.17776-1-bijan311@gmail.com/
Merged in Linux 6.18
Ravi Jonnalagadda (5):
mm/damon/core: add DAMOS_QUOTA_NODE_SYS_BP metric
mm/damon: add get_goal_metric() op and PA provider
mm/damon/core: add new ops-specific goal metric
mm/damon/paddr: capacity clamp and directional early-exit for
node_sys_bp
mm/damon/sysfs-schemes: accept "node_sys_bp" in goal's target_metric
include/linux/damon.h | 5 ++
mm/damon/core.c | 34 ++++++++++---
mm/damon/paddr.c | 102 +++++++++++++++++++++++++++++++++++++++
mm/damon/sysfs-schemes.c | 7 +++
4 files changed, 141 insertions(+), 7 deletions(-)
--
2.43.0
Powered by blists - more mailing lists