[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260128012539.72119-1-sj@kernel.org>
Date: Tue, 27 Jan 2026 17:25:37 -0800
From: SeongJae Park <sj@...nel.org>
To: Ravi Jonnalagadda <ravis.opensrc@...il.com>
Cc: SeongJae Park <sj@...nel.org>,
damon@...ts.linux.dev,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org,
akpm@...ux-foundation.org,
corbet@....net,
bijan311@...il.com,
ajayjoshi@...ron.com,
Honggyu Kim <honggyu.kim@...com>,
Yunjeong Mun <yunjeong.mun@...com>
Subject: Re: [RFC PATCH 0/5] mm/damon: Add node_sys_bp quota goal metric for PA-based migration control
On Tue, 27 Jan 2026 10:52:53 -0800 Ravi Jonnalagadda <ravis.opensrc@...il.com> wrote:
> On Fri, Jan 23, 2026 at 5:50 PM SeongJae Park <sj@...nel.org> wrote:
> >
> > Cc-ing SK hynix folks (Honggyu and Yunjeong) for quota auto-tuning behavior
> > confusion (not stop immediately after satisfying the goal) I discuss below.
> >
> > Hello Ravi,
> >
>
> Hi SJ,
>
> Thank you for the detailed review and for Cc'ing SK hynix folks.
>
> >
> > Thank you for sharing this great RFC!
> >
> > I had a chance to be involved in a high level design of this series, off the
> > mailing list. I'm disclosing the fact for others' context, since a few of my
> > comments below are based on the previous off-list discussion.
> >
> > On Thu, 22 Jan 2026 20:57:23 -0800 Ravi Jonnalagadda <ravis.opensrc@...il.com> wrote:
> >
> > > This series introduces a new DAMON quota goal metric,
> >
> > As I discussed off the list, I believe the new goal can be useful. Thank you
> > again for making this!
> >
> > > `node_sys_bp`, designed
> > > for controlling memory migration in heterogeneous memory systems (e.g.,
> > > DRAM↔CXL tiering). These patches are provided as an initial RFC and have
> > > not been tested on actual hardware.
> >
> > I feel like the name is not very self explaining what it is, though. I'm also
> > bit confused about what it does is what we discussed off list. I'll add more
> > details below, on more detailed explanation about the new goal.
> >
>
> Will change the name to damos_target_mem_node_sys_bp in v2.
Sounds good name. But, could it be more consistent with existing node-related
quota goal metrics such as 'node_mem_used_bp' and 'node_mem_free_bp'? What
about... say... 'node_target_mem_bp'? Please feel free to suggest a better
one.
[...]
> > > Solution: bp-Based Target State Goals
> > > =====================================
> > >
> > > Instead of specifying action rates, `node_sys_bp` specifies a TARGET STATE:
> > >
> > > "Node N should contain X basis points (X/10000) of system memory"
> >
> > I'm not sure if this is exactly what we discussed off list. What I expected
> > the goal would be based on our discussion is, what the first patch is saying.
> > To quote the part,
> >
> > @@ -155,6 +155,7 @@ enum damos_action {
> > * @DAMOS_QUOTA_NODE_MEM_FREE_BP: MemFree ratio of a node.
> > * @DAMOS_QUOTA_NODE_MEMCG_USED_BP: MemUsed ratio of a node for a cgroup.
> > * @DAMOS_QUOTA_NODE_MEMCG_FREE_BP: MemFree ratio of a node for a cgroup.
> > + * @DAMOS_QUOTA_NODE_SYS_BP: Scheme-eligible bytes ratio of a node.
> > * @NR_DAMOS_QUOTA_GOAL_METRICS: Number of DAMOS quota goal metrics.
> >
> > That is, my understanding of what we want to achive with the new goal is,
> > letting users to ask DAMON "migrate hot pages of node A to node B, aiming
> > X % of node B becomes hot, as a result of the migrations".
> >
>
> Yes, this is the intent. Looking back at my implementation, I see the
> mismatch:
>
> 1. **Numerator**: Should count only scheme-eligible bytes.
>
> 2. **Denominator**: Should use node capacity, but I used total system
> memory.
>
> > But your above description is not saying about the "scheme-eligibility". Also,
> > the goal metric is the ratio of the memory to the entire system memory, not
> > just a given node. My quick read of damon_pa_get_node_sys_bp() on the second
> > patch of this series confirms the implementation is following your description,
> > not what I imagined.
> >
>
> You're right. The implementation diverged from what we discussed. I'll
> fix both the numerator and denominator in v2.
Thank you for clarifying it and sharing the next plan, all sounds good!
[...]
> > Please note that the goal-based quota auto-tuning works in proportional way,
> > preferring small steps and "eventual" goal convergence. As a result, migration
> > will occur a few more times until it is completely stopped after the goal is
> > satisfied. Unless there is another scheme that migrates pages into node 0, you
> > may end up having node 0 having a bit less than the 40% memory.
> >
> > >
> > > No oscillation - migration stops when target state is reached.
> >
> > So, little bit of oscillation could still happen. Hopefully that shouldn't be
> > significant, though.
> >
>
> Yes, As we discussed offline, for a two-context approach:
>
> Context 0: monitors node 0, migrate_hot → node 1
> goal: damos_target_mem_node_sys_bp, nid=1, target=4000
> "Stop when node 1 is 40% hot"
>
> Context 1: monitors node 1, migrate_hot → node 0
> goal: damos_target_mem_node_sys_bp, nid=0, target=6000
> "Stop when node 0 is 60% hot"
>
> Each context eventually stops when its target node reaches the desired
> threshold,
> and the reverse direction is handled by the other context. For my use
> case, eventual convergence with this setup could be acceptable.
Thank you for clarifying this! I'm relieved we don't have a concern here.
> An immediate-stop feature could still be useful for the broader community.
Thank you for the feedback. I will take more time on thinking about how to
implement this.
> Will test and post results after the next iteration.
>
> > IIRC, SK hynix people also confused with the behavior when they experimented
> > migrate_{hot,cold} action with NODE_MEM_USED_BP goal based quota auto-tuning,
> > but using only a single scheme that does migration in a single direction.
> > Because this is at least second time it made confusion, if you need, maybe I
> > can try to add a feature for making DAMOS immediately stops after the goal is
> > satisfied. Let me know if such new feature can be useful for you. Cc-ing SK
> > hynix people (Honggyu and Yunjeong) so that they can correct me if my memory is
> > broken, or answer if the new feature I described here can be useful for them.
[...]
> > > Why get_goal_metric() Ops Callback
> > > ==================================
> > >
> > > The bp calculation requires iterating over monitored PA regions:
> > >
> > > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > > if (page_to_nid(pfn_to_page(pfn)) == nid)
> > > node_bytes += PAGE_SIZE;
> > > }
> > > bp = node_bytes * 10000 / system_total;
> > >
> > > This requires address-space knowledge that only the ops provider has.
> > > Existing goal metrics (PSI, node_mem_*, node_memcg_*) are computed in
> > > core using system-wide data that doesn't require iterating monitored
> > > regions.
> > >
> > > The new `get_goal_metric()` callback in `damon_operations` allows:
> > >
> > > 1. Core to remain generic - handles all common metrics
> >
> > I agree this is indeed making the design clean. But, we already having such
> > exception, like core.c code using 'damon_target_has_pid(). Having just one
> > more exception seems ok to me, unless it makes code too ugly.
> >
> > > 2. Ops providers to implement metrics requiring region iteration
> > > 3. Clean separation - PA implements node_sys_bp, VA could add
> > > different metrics in future
> >
> > I agree it could be useful for clean support of virtual address mode in future.
> > But, I'd prefer making this as simple and small as possible for the support we
> > will use at the moment.
> >
> > > 4. Optional implementation - ops return 0 if metric unsupported
> >
> > Again, letting core logic having a bit of ops layer information is not a big
> > problem to my humble perspective.
> >
> > So, I'd more prefer not adding the Ops callback, unless you have some other
> > concerns.
> >
>
> Agreed. I'll remove the get_goal_metric() ops callback in v2
Thank you for flexibly accepting my suggestion.
[...]
> > > Advantages of PA-Based Migration
> > > ================================
> > >
> > > PA-based migration with DAMON enables integration of multiple hotness
> > > sources for migration decisions:
> > >
> > > 1. DAMON's native access pattern monitoring
> > > 2. Fault-based information (similar to NUMA Balancing)
> > > 3. Future: Hardware monitoring units (e.g., CXL CHMU)
> > > 4. Future: Instruction-Based Sampling (AMD IBS, Intel PEBS)
> > >
> > > Unlike VA-based approaches tied to individual process address spaces, PA
> > > monitoring can aggregate hotness information from multiple sources to make
> > > system-wide migration decisions across the entire physical memory space.
> >
> > Maybe you are saying about the damon_report_access() based DAMON extension
> > project [1]? Since it is not yet upstreamed, and the long term plan is to
> > support not only physical address but also virtual address space, I think this
> > section is better to be removed, unless the DAMON extension project is merged
> > before this patch series. I expect this patch series will be merged much
> > earlier than the extension project.
> >
>
> Agreed. I'll remove the references to future hotness sources.
Thank you!
[...]
> > > Feedback on the overall approach and design is appreciated.
> >
> > I hope my above comments helps the forward progress of this nice series.
> >
>
> They are very helpful. Thanks a lot.
The pleasure is mine, looking forward to the next version of this patch series
:)
Thanks,
SJ
[...]
Powered by blists - more mailing lists