linux-kernel - Re: [RFC PATCH 0/5] mm/damon: Add node_sys_bp quota goal metric for PA-based migration control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260124015045.78075-1-sj@kernel.org>
Date: Fri, 23 Jan 2026 17:50:43 -0800
From: SeongJae Park <sj@...nel.org>
To: Ravi Jonnalagadda <ravis.opensrc@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
	damon@...ts.linux.dev,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	linux-doc@...r.kernel.org,
	akpm@...ux-foundation.org,
	corbet@....net,
	bijan311@...il.com,
	ajayjoshi@...ron.com,
	Honggyu Kim <honggyu.kim@...com>,
	Yunjeong Mun <yunjeong.mun@...com>
Subject: Re: [RFC PATCH 0/5] mm/damon: Add node_sys_bp quota goal metric for PA-based migration control

Cc-ing SK hynix folks (Honggyu and Yunjeong) for quota auto-tuning behavior
confusion (not stop immediately after satisfying the goal) I discuss below.

Hello Ravi,


Thank you for sharing this great RFC!

I had a chance to be involved in a high level design of this series, off the
mailing list.  I'm disclosing the fact for others' context, since a few of my
comments below are based on the previous off-list discussion.

On Thu, 22 Jan 2026 20:57:23 -0800 Ravi Jonnalagadda <ravis.opensrc@...il.com> wrote:

> This series introduces a new DAMON quota goal metric,

As I discussed off the list, I believe the new goal can be useful.  Thank you
again for making this!

> `node_sys_bp`, designed
> for controlling memory migration in heterogeneous memory systems (e.g.,
> DRAM↔CXL tiering). These patches are provided as an initial RFC and have
> not been tested on actual hardware.

I feel like the name is not very self explaining what it is, though.  I'm also
bit confused about what it does is what we discussed off list.  I'll add more
details below, on more detailed explanation about the new goal.

> 
> Background and Motivation
> =========================
> 
> A previous patch series [1] by Bijan Tabatabai and myself added weighted
> interleave support for DAMON migrate_{hot,cold} actions. That series
> implemented the feature for vaddr (virtual address) schemes because the
> weight-based approach requires VMA offset information to determine target
> nodes:
> 
>     target_node = (vma_offset % total_weight) → node_from_weights
> 
> For paddr (physical address) schemes, obtaining VMA offset requires costly
> rmap (reverse mapping) walks. As noted in that series:
> 
>     "However, finding out how a folio is mapped inside of a VMA requires
>      a costly rmap walk when using a paddr scheme. As such, we have
>      decided that this functionality makes more sense as a vaddr scheme."

Indeed the rmap overhead was one main concern.  I'd like to also note that,
however, another major concern from me was that the behavior sounds not easy to
intuitively understand.

> 
> This new series takes a different approach that enables PA-based migration
> WITHOUT requiring rmap walks, by using basis points (bp) target-state goals
> instead of weight-based action rates.
> 
> The rmap Cost vs Oscillation Trade-off
> ======================================
> 
> For PA-based migration with weights, there are two possibilities:
> 
>   1. Weight-based with rmap:
>      - Use rmap to find VMA offset for each physical page
>      - Apply weights based on VMA offset (same algorithm as VA)
>      - Works correctly: VMA offset provides stable identity
>      - Problem: rmap walks are expensive for every migration candidate
> 
>   2. Weight-based without rmap:
>      - Attempt to apply weights using only physical address information
>      - No stable identity across migrations
>      - Results in page oscillation (see below)
>      - Not viable
> 
> The Oscillation Problem (Weights Without rmap)
> ==============================================
> 
> Weight-based migration relies on a stable identifier to determine which node
> a page "belongs to". For VA, this is the VMA offset - it remains constant
> regardless of which physical node backs the page. For PA without rmap, no
> such stable identifier exists.
> 
> Consider a two-node system with weights {Node 0: 40, Node 1: 60}:
> 
>   Initial state:
>     Hot pages on Node 0: PFN 0x1000, 0x1001, 0x1002, 0x1003, 0x1004
>     Node 1: empty
> 
>   Iteration 1 - Weight-based selection (no rmap):
>     System tries to achieve 40/60 distribution
>     Selects pages at PFN 0x1002, 0x1003, 0x1004 (60%) for Node 1
>     After migration, these pages get NEW PFNs on Node 1:
>       PFN 0x1002 → PFN 0x5000 (Node 1)
>       PFN 0x1003 → PFN 0x5001 (Node 1)
>       PFN 0x1004 → PFN 0x5002 (Node 1)
> 
>   State after Iteration 1:
>     Node 0: PFN 0x1000, 0x1001 (40%)
>     Node 1: PFN 0x5000, 0x5001, 0x5002 (60%)
> 
>   Iteration 2 - Weight-based selection runs again:
>     System sees pages at PFN 0x5000, 0x5001, 0x5002 on Node 1
>     These are "new" pages from the system's perspective
>     NO MEMORY that these were just migrated FROM Node 0
>     Weight-based logic may select some for migration back to Node 0
> 
>   Iteration 3, 4, 5...:
>     Same pages continue bouncing between nodes
>     Each migration changes the PFN, erasing any "history"
>     System never converges to stable state
> 
> The fundamental issue: weights define an ACTION RATE ("migrate X% of
> candidate pages to each node") rather than a TARGET STATE. Without stable
> page identity (which rmap provides via VMA offset), the system cannot
> determine which pages have already been "handled" and continues to
> reprocess the same logical pages indefinitely.
> 
> With rmap, the VMA offset provides stable identity - a page at file offset
> 0x1000 always hashes to the same target node regardless of its current PFN.
> Without rmap, we have no such anchor, and weights become meaningless.
> 
> Solution: bp-Based Target State Goals
> =====================================
> 
> Instead of specifying action rates, `node_sys_bp` specifies a TARGET STATE:
> 
>     "Node N should contain X basis points (X/10000) of system memory"

I'm not sure if this is exactly what we discussed off list.  What I expected
the goal would be based on our discussion is, what the first patch is saying.
To quote the part,

    @@ -155,6 +155,7 @@ enum damos_action {
      * @DAMOS_QUOTA_NODE_MEM_FREE_BP:      MemFree ratio of a node.
      * @DAMOS_QUOTA_NODE_MEMCG_USED_BP:    MemUsed ratio of a node for a cgroup.
      * @DAMOS_QUOTA_NODE_MEMCG_FREE_BP:    MemFree ratio of a node for a cgroup.
    + * @DAMOS_QUOTA_NODE_SYS_BP:           Scheme-eligible bytes ratio of a node.
      * @NR_DAMOS_QUOTA_GOAL_METRICS:       Number of DAMOS quota goal metrics.

That is, my understanding of what we want to achive with the new goal is,
letting users to ask DAMON "migrate hot pages of node A to node B, aiming
X % of node B becomes hot, as a result of the migrations".

But your above description is not saying about the "scheme-eligibility".  Also,
the goal metric is the ratio of the memory to the entire system memory, not
just a given node.  My quick read of damon_pa_get_node_sys_bp() on the second
patch of this series confirms the implementation is following your description,
not what I imagined.

Have you found something we missed during the off list discussion, and
therefore changed the definition of the goal to the current one?  If so, could
you share the details?  Or, you might also wanted to implement what I expected,
but something bothered your code and documentation?

> 
> The migration control loop:
> 
>   1. Calculate current_bp: sum bytes per node across monitored regions
>   2. Compare: if current_bp >= target_bp, STOP (goal satisfied)
>   3. Otherwise: continue migrating toward target
> 
>   Example with target: "Node 0 should have 4000 bp (40%)"

Regardless of why my expectation was different from this one, the goal I was
expecting might still be used for your descripted case.  That is, the scheme's
target access pattern could be "any access pattern", and the target metric
value could be adjusted based on the ratio between the node 0 memory and the
total memory.  For example, let's suppose there is 100 GiB memory in total, and
80 GiB memory in node 0 (80% of total memory in node 0).  Then 40 % of total
memory, which is the target of the above example, is 40 GiB, which is 50 % of
node 0.  So, you could use the goal I expected, with 50 % as the target value.
Imaginable usage of that would look like,

    # --damos_quota_goal receives <metric name>, <target metric value>, and
    # <node id of the goal>
    damo tune --damos_action migrate_hot --damos_quota_goal new_goal_metric 50% 0

I sure you know that you could s/migrate_hot/migrate_cold/ if you prefer
migrating cold pages first.

So, the metric of my expected definition can still be useful for your new use
case.  I'd prefer my expected definition if there is no concern, since it seems
useful for more cases.

> 
>     Iteration 1:
>       current_bp = 10000 (100% on Node 0)

How could 100% system memory can be on node 0?  Are you saying 100% of "hot"
memory?  This makes me suspect you _might_ actually wanted to implement what I
was expecting and the first patch of this series is documenting
(Scheme-eligible bytes ratio of a node) but somehow something bothered you to
drop the ball in a wrong place.

>       target_bp  = 4000 (40%)
>       current > target → migrate cold pages away from Node 0
> 
>     After Iteration 1:
>       current_bp = 4000 (40% on Node 0)
> 
>     Iteration 2:
>       current_bp = 4000
>       target_bp  = 4000
>       current >= target → STOP, goal satisfied

Please note that the goal-based quota auto-tuning works in proportional way,
preferring small steps and "eventual" goal convergence.  As a result, migration
will occur a few more times until it is completely stopped after the goal is
satisfied.  Unless there is another scheme that migrates pages into node 0, you
may end up having node 0 having a bit less than the 40% memory.

> 
>     No oscillation - migration stops when target state is reached.

So, little bit of oscillation could still happen.  Hopefully that shouldn't be
significant, though.

IIRC, SK hynix people also confused with the behavior when they experimented
migrate_{hot,cold} action with NODE_MEM_USED_BP goal based quota auto-tuning,
but using only a single scheme that does migration in a single direction.
Because this is at least second time it made confusion, if you need, maybe I
can try to add a feature for making DAMOS immediately stops after the goal is
satisfied.  Let me know if such new feature can be useful for you.  Cc-ing SK
hynix people (Honggyu and Yunjeong) so that they can correct me if my memory is
broken, or answer if the new feature I described here can be useful for them.

> 
> No page identity tracking needed because we measure the END STATE, not
> which specific pages were moved. The early-exit prevents oscillation by
> stopping when the goal is met.
> 
> Why get_goal_metric() Ops Callback
> ==================================
> 
> The bp calculation requires iterating over monitored PA regions:
> 
>     for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>         if (page_to_nid(pfn_to_page(pfn)) == nid)
>             node_bytes += PAGE_SIZE;
>     }
>     bp = node_bytes * 10000 / system_total;
> 
> This requires address-space knowledge that only the ops provider has.
> Existing goal metrics (PSI, node_mem_*, node_memcg_*) are computed in
> core using system-wide data that doesn't require iterating monitored
> regions.
> 
> The new `get_goal_metric()` callback in `damon_operations` allows:
> 
>   1. Core to remain generic - handles all common metrics

I agree this is indeed making the design clean.  But, we already having such
exception, like core.c code using 'damon_target_has_pid().  Having just one
more exception seems ok to me, unless it makes code too ugly.

>   2. Ops providers to implement metrics requiring region iteration
>   3. Clean separation - PA implements node_sys_bp, VA could add
>      different metrics in future

I agree it could be useful for clean support of virtual address mode in future.
But, I'd prefer making this as simple and small as possible for the support we
will use at the moment.

>   4. Optional implementation - ops return 0 if metric unsupported

Again, letting core logic having a bit of ops layer information is not a big
problem to my humble perspective.

So, I'd more prefer not adding the Ops callback, unless you have some other
concerns.

> 
> This design ensures node_sys_bp is only computed when using PA contexts
> where it makes sense, while keeping the core quota goal infrastructure
> unchanged for other metrics and ops providers.
> 
> The callback pattern allows each ops provider to implement metrics
> specific to its address space model without burdening the core with
> ops-specific knowledge.
> 
> Advantages of PA-Based Migration
> ================================
> 
> PA-based migration with DAMON enables integration of multiple hotness
> sources for migration decisions:
> 
>   1. DAMON's native access pattern monitoring
>   2. Fault-based information (similar to NUMA Balancing)
>   3. Future: Hardware monitoring units (e.g., CXL CHMU)
>   4. Future: Instruction-Based Sampling (AMD IBS, Intel PEBS)
> 
> Unlike VA-based approaches tied to individual process address spaces, PA
> monitoring can aggregate hotness information from multiple sources to make
> system-wide migration decisions across the entire physical memory space.

Maybe you are saying about the damon_report_access() based DAMON extension
project [1]?  Since it is not yet upstreamed, and the long term plan is to
support not only physical address but also virtual address space, I think this
section is better to be removed, unless the DAMON extension project is merged
before this patch series.  I expect this patch series will be merged much
earlier than the extension project.

> 
> Complementary to Existing vaddr Migration
> =========================================
> 
> This series complements rather than replaces the vaddr weighted interleave
> migration merged in 6.18:
> 
>   vaddr migration (weight-based):
>     - Per-process control
>     - Fine-grained interleave patterns via VMA offset
>     - Deterministic placement based on weights
> 
>   paddr migration (bp-based, this series):
>     - System-wide control
>     - Target-state goals for node capacity management
>     - No rmap overhead
>     - Aggregates multiple hotness sources

As I mentioned above, the last item could be dropped until the extension
project [1] is upstreamed.

> 
> Capacity Clamping
> =================
> 
> The series also implements capacity clamping for `node_sys_bp` goals. In a
> system where Node 0 has 40% of total RAM, setting a target of 50% is
> impossible. The implementation clamps:
> 
>     effective_target = min(user_target, node_capacity_bp)
> 
> This prevents the quota auto-tuning from chasing impossible targets and
> avoids thrashing in two-context DRAM↔CXL setups.
> 
> Patches
> =======
> 
>   1/5: mm/damon/core: add DAMOS_QUOTA_NODE_SYS_BP metric
>        Adds the enum value and documentation.
> 
>   2/5: mm/damon: add get_goal_metric() op and PA provider
>        Introduces the ops callback and PA implementation that iterates
>        monitored regions to calculate node_sys_bp without rmap.
> 
>   3/5: mm/damon/core: add new ops-specific goal metric
>        Wires the new metric into core's quota goal evaluation, delegating
>        to ops.get_goal_metric() for DAMOS_QUOTA_NODE_SYS_BP.
> 
>   4/5: mm/damon/paddr: capacity clamp and directional early-exit
>        Adds capacity clamping and early-exit logic to prevent migration
>        when goal is already satisfied.
> 
>   5/5: mm/damon/sysfs-schemes: accept "node_sys_bp" in goal's target_metric
>        Exposes the new metric to userspace via sysfs.
> 
> Status
> ======
> 
> This is an early RFC for design review. The patches:
>   - Compile successfully with no errors or warnings
>   - Have NOT been tested on actual hardware

I understand you plan to run tests, and that would be useful, appreciate.  Note
that I think having no test results is blocker for RFC patch series.  So, again
appreciate for sharing this RFC! :)

> 
> Feedback on the overall approach and design is appreciated.

I hope my above comments helps the forward progress of this nice series.

> 
> References
> ==========
> 
> [1] mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions (v4)
>     https://lore.kernel.org/linux-mm/20250709005952.17776-1-bijan311@gmail.com/
>     Merged in Linux 6.18
> 
> 
> Ravi Jonnalagadda (5):
>   mm/damon/core: add DAMOS_QUOTA_NODE_SYS_BP metric
>   mm/damon: add get_goal_metric() op and PA provider
>   mm/damon/core: add new ops-specific goal metric
>   mm/damon/paddr: capacity clamp and directional early-exit for
>     node_sys_bp
>   mm/damon/sysfs-schemes: accept "node_sys_bp" in goal's target_metric
> 
>  include/linux/damon.h    |   5 ++
>  mm/damon/core.c          |  34 ++++++++++---
>  mm/damon/paddr.c         | 102 +++++++++++++++++++++++++++++++++++++++
>  mm/damon/sysfs-schemes.c |   7 +++
>  4 files changed, 141 insertions(+), 7 deletions(-)
> 
> -- 
> 2.43.0

[1] https://lore.kernel.org/damon/20251208062943.68824-1-sj@kernel.org/


Thanks,
SJ