[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250613152517.225529-1-joshua.hahnjy@gmail.com>
Date: Fri, 13 Jun 2025 08:25:09 -0700
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Bijan Tabatabai <bijan311@...il.com>
Cc: damon@...ts.linux.com,
linux-mm@...ck.org,
linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
sj@...nel.org,
akpm@...ux-foundation.org,
corbet@....net,
david@...hat.com,
ziy@...dia.com,
matthew.brost@...el.com,
joshua.hahnjy@...il.com,
rakie.kim@...com,
byungchul@...com,
gourry@...rry.net,
ying.huang@...ux.alibaba.com,
apopple@...dia.com,
bijantabatab@...ron.com,
venkataravis@...ron.com,
emirakhur@...ron.com,
ajayjoshi@...ron.com,
vtavarespetr@...ron.com
Subject: Re: [RFC PATCH 0/4] mm/damon: Add DAMOS action to interleave data across nodes
On Thu, 12 Jun 2025 13:13:26 -0500 Bijan Tabatabai <bijan311@...il.com> wrote:
> From: Bijan Tabatabai <bijantabatab@...ron.com>
>
> A recent patch set automatically set the interleave weight for each node
> according to the node's maximum bandwidth [1]. In another thread, the patch
> set's author, Joshua Hahn, wondered if/how these weights should be changed
> if the bandwidth utilization of the system changes [2].
Hi Bijan,
Thank you for this patchset, and thank you for finding interest in my
question!
> This patch set adds the mechanism for dynamically changing how application
> data is interleaved across nodes while leaving the policy of what the
> interleave weights should be to userspace. It does this by adding a new
> DAMOS action: DAMOS_INTERLEAVE. We implement DAMOS_INTERLEAVE with both
> paddr and vaddr operations sets. Using the paddr version is useful for
> managing page placement globally. Using the vaddr version limits tracking
> to one process per kdamond instance, but the va based tracking better
> captures spacial locality.
>
> DAMOS_INTERLEAVE interleaves pages within a region across nodes using the
> interleave weights at /sys/kernel/mm/mempolicy/weighted_interleave/node<N>
> and the page placement algorithm in weighted_interleave_nid via
> policy_nodemask. We chose to reuse the mempolicy weighted interleave
> infrastructure to avoid reimplementing code. However, this has the awkward
> side effect that only pages that are mapped to processes using
> MPOL_WEIGHTED_INTERLEAVE will be migrated according to new interleave
> weights. This might be fine because workloads that want their data to be
> dynamically interleaved will want their newly allocated data to be
> interleaved at the same ratio.
I think this is generally true. Maybe until a user says that they have a
usecase where they would like to have a non-weighted-interleave policy
to allocate pages, but would like to place them according to a set weight,
we can leave support for other mempolicies out for now.
> If exposing policy_nodemask is undesirable, we have two alternative methods
> for having DAMON access the interleave weights it should use. We would
> appreciate feedback on which method is preferred.
> 1. Use mpol_misplaced instead
> pros: mpol_misplaced is already exposed publically
> cons: Would require refactoring mpol_misplaced to take a struct vm_area
> instead of a struct vm_fault, and require refactoring mpol_misplaced and
> get_vma_policy to take in a struct task_struct rather than just using
> current. Also requires processes to use MPOL_WEIGHTED_INTERLEAVE.
> 2. Add a new field to struct damos, similar to target_nid for the
> MIGRATE_HOT/COLD schemes.
> pros: Keeps changes contained inside DAMON. Would not require processes
> to use MPOL_WEIGHTED_INTERLEAVE.
> cons: Duplicates page placement code. Requires discussion on the sysfs
> interface to use for users to pass in the interleave weights.
Here I agree with SJ's sentiment -- I think mpol_misplaced runs with the
context of working with current / fault contexts, like you pointed out.
Perhaps it is best to keep the scope of the changes as local as possible : -)
As for duplicating page placement code, I think that is something we can
refine over iterations of this patchset, and maybe SJ will have some great
ideas on how this can best be done as well.
> This patchset was tested on an AMD machine with a NUMA node with CPUs
> attached to DDR memory and a cpu-less NUMA node attached to CXL memory.
> However, this patch set should generalize to other architectures and number
> of NUMA nodes.
I think moving the test results to the cover letter will help reviewers
better understand the intent of the work. Also, I think it will also be
very helpful to include some potential use-cases in here as well. That is,
what workloads would benefit from placing pages according to a set ratio,
rather than using existing migration policies that adjust this based on
hotness / coldness?
One such use case that I can think of is using this patchset + weighted
interleave auto-tuning, which would help alleviate bandwidth limitations
by ensuring that past the allocation stage, pages are being accessed
in a way that maximizes the bandwidth usage of the system (at the cost of
latency, which may or may not even be true based on how bandwidth-bound
the workload is).
Thank you again for the amazing patchset! Have a great day : -)
Joshua
Sent using hkml (https://github.com/sjp38/hackermail)
Powered by blists - more mailing lists