[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b7ff190-4efe-47d0-82fb-68135a031b0f@kernel.org>
Date: Mon, 8 Dec 2025 12:19:41 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: SeongJae Park <sj@...nel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>, Jann Horn <jannh@...gle.com>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Michal Hocko
<mhocko@...e.com>, Mike Rapoport <rppt@...nel.org>,
Pedro Falcato <pfalcato@...e.de>, Suren Baghdasaryan <surenb@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [RFC PATCH v3 05/37] mm/{mprotect,memory}: (no upstream-aimed
hack) implement MM_CP_DAMON
On 12/8/25 07:29, SeongJae Park wrote:
> Note that this is not upstreamable as-is. This is only for helping
> discussion of other changes of its series.
>
> DAMON is using Accessed bits of page table entries as the major source
> of the access information. It lacks some additional information such as
> which CPU was making the access. Page faults could be another source of
> information for such additional information.
>
> Implement another change_protection() flag for such use cases, namely
> MM_CP_DAMON. DAMON will install PAGE_NONE protections using the flag.
> To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> protection, pass the faults to DAMON only when NUMA_BALANCING is
> disabled.
>
> Again, this is not upstreamable as-is. There were comments about this
> on the previous version, and I was unable to take time on addressing
> those. As a result, this version is not addressing any of those
> previous comments. I'm sending this, though, to help discussions on
> patches of its series, except this one. Please forgive me adding this
> to your inbox without addressing your comments, and ignore. I will
> establish another discussion for this part later.
>
> Signed-off-by: SeongJae Park <sj@...nel.org>
> ---
> include/linux/mm.h | 1 +
> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++--
> mm/mprotect.c | 5 ++++
> 3 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 553cf9f438f1..2cba5a0196da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2848,6 +2848,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
> #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
> MM_CP_UFFD_WP_RESOLVE)
> +#define MM_CP_DAMON (1UL << 4)
>
> bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> diff --git a/mm/memory.c b/mm/memory.c
> index 6675e87eb7dd..5dc85adb1e59 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -78,6 +78,7 @@
> #include <linux/sched/sysctl.h>
> #include <linux/pgalloc.h>
> #include <linux/uaccess.h>
> +#include <linux/damon.h>
>
> #include <trace/events/kmem.h>
>
> @@ -6172,6 +6173,54 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> return VM_FAULT_FALLBACK;
> }
>
> +/*
> + * NOTE: This is only poc purpose "hack" that will not be upstreamed as is.
> + * More discussions between all stakeholders including maintainers of MM core,
> + * NUMA balancing, and DAMON should be made to make this upstreamable.
> + * (https://lore.kernel.org/20251128193947.80866-1-sj@kernel.org)
> + *
> + * This function is called from page fault handler, for page faults on
> + * P{TE,MD}-protected but vma-accessible pages. DAMON is making the fake
> + * protection for access sampling purpose. This function simply clear the
> + * protection and report this access to DAMON, by calling
> + * damon_report_page_fault().
> + *
> + * The protection clear code is copied from NUMA fault handling code for PTE.
> + * Again, this is only poc purpose "hack" to show what information DAMON want
> + * from page fault events, rather than an upstream-aimed version.
> + */
> +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct folio *folio;
> + pte_t pte, old_pte;
> + bool writable = false, ignore_writable = false;
> + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> +
> + spin_lock(vmf->ptl);
> + old_pte = ptep_get(vmf->pte);
> + if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + return 0;
> + }
> + pte = pte_modify(old_pte, vma->vm_page_prot);
> + writable = pte_write(pte);
> + if (!writable && pte_write_upgrade &&
> + can_change_pte_writable(vma, vmf->address, pte))
> + writable = true;
> + folio = vm_normal_folio(vma, vmf->address, pte);
> + if (folio && folio_test_large(folio))
> + numa_rebuild_large_mapping(vmf, vma, folio, pte,
> + ignore_writable, pte_write_upgrade);
> + else
> + numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
> + writable);
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> + damon_report_page_fault(vmf, huge_pmd);
> + return 0;
> +}
> +
> /*
> * These routines also need to handle stuff like marking pages dirty
> * and/or accessed for architectures that don't do it in hardware (most
> @@ -6236,8 +6285,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> if (!pte_present(vmf->orig_pte))
> return do_swap_page(vmf);
>
> - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
> + if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
> + return do_damon_page(vmf, false);
> return do_numa_page(vmf);
> + }
>
> spin_lock(vmf->ptl);
> entry = vmf->orig_pte;
> @@ -6363,8 +6415,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> return 0;
> }
> if (pmd_trans_huge(vmf.orig_pmd)) {
> - if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
> + if (sysctl_numa_balancing_mode ==
> + NUMA_BALANCING_DISABLED)
> + return do_damon_page(&vmf, true);
> return do_huge_pmd_numa_page(&vmf);
> + }
I recall that we had a similar discussion already. Ah, it was around
some arm MTE tag storage reuse [1].
The idea was to let do_*_numa_page() handle the restoring so we don't
end up with such duplicated code.
[1]
https://lore.kernel.org/all/20240125164256.4147-1-alexandru.elisei@arm.com/
--
Cheers
David
Powered by blists - more mailing lists