linux-kernel - Re: [RFC v2 5/7] mm/memory: implement MM_CP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <804015e3-78f7-47d8-b887-434f0f7edf8a@lucifer.local>
Date: Mon, 28 Jul 2025 06:19:35 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: SeongJae Park <sj@...nel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...e.com>, Mike Rapoport <rppt@...nel.org>,
        Pedro Falcato <pfalcato@...e.de>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>, damon@...ts.linux.dev,
        kernel-team@...a.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC v2 5/7] mm/memory: implement MM_CP_DAMON

On Sun, Jul 27, 2025 at 01:18:11PM -0700, SeongJae Park wrote:
> DAMON is using Accessed bits of page table entries as the major source
> of the access information.  It lacks some additional information such as
> which CPU was making the access.  Page faults could be another source of
> information for such additional information.
>
> Implement another change_protection() flag for such use case, namely
> MM_CP_DAMON.  DAMON will install PAGE_NONE protections using the flag.
> To avoid interfering with NUMA_BALANCING, which is also using PAGE_NON
> protection, pass the faults to DAMON only when NUMA_BALANCING is
> disabled.
>
> Signed-off-by: SeongJae Park <sj@...nel.org>

This seems to not be an upstreamable series right now unless I'm missing
something.

Firstly, you're making a change in behaviour even when CONFIG_DAMON is not
specified, and Linus already told you we can't have that default-on.

I'm very very much not happy with us doing something for 'damon'
unconditionaly when !CONFIG_DAMON on the assumption that an acessible
mapping has PROT_NONE set.

This change makes me nervous in general ANYWAY as you are now changing core
mm and introducing a new faulting mechanism specifically for DAMON.

And we are assuming that NUMA balancing being disabled is not racey in a
way that will cause things to break.

I really also dislike the idea of an _implicit_ assumption that we are good
to use the NUMA balancing faulting mechanism to 'tack on' DAMON stuff.

Is it really all that useful to provide a method that requires NUMA
balancing to be diabled?

Finally, you're making it so DAMON can mprotect() something to use the
DAMON/NUMA balancing fault handler, which doesn't appaer to check to see if
NUMA balacing is disabled, but anyway it could be re-enabled?

And then now DAMON is making stuff fault as NUMA balancing faults
incorrectly?

This just seems broken.

Unless there's really good justification I'm really not inclined for us to
merge this as-is right now unfortunately.

> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c        | 53 ++++++++++++++++++++++++++++++++++++++++++++--
>  mm/mprotect.c      |  5 +++++
>  3 files changed, 57 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21270f1664a4..ad92b77bf782 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2567,6 +2567,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  #define  MM_CP_UFFD_WP_RESOLVE             (1UL << 3) /* Resolve wp */
>  #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>  					    MM_CP_UFFD_WP_RESOLVE)

/* Comment here :) */

> +#define MM_CP_DAMON                        (1UL << 4)

Shouldn't this be something more specific like MM_CP_DAMON_REPORT_FAULT ?

>
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1..656e610867b0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -75,6 +75,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/damon.h>
>
>  #include <trace/events/kmem.h>
>
> @@ -5972,6 +5973,47 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>  	return VM_FAULT_FALLBACK;
>  }
>
> +static vm_fault_t do_damon_page(struct vm_fault *vmf, bool huge_pmd)

You need to explain what this function is doing at least.

> +{
> +	struct damon_access_report access_report = {
> +		.addr = vmf->address,
> +		.size = 1,
> +	};
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct folio *folio;
> +	pte_t pte, old_pte;
> +	bool writable = false, ignore_writable = false;
> +	bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma);
> +
> +	if (huge_pmd)
> +		access_report.addr = PFN_PHYS(pmd_pfn(vmf->orig_pmd));
> +	else
> +		access_report.addr = PFN_PHYS(pte_pfn(vmf->orig_pte));
> +
> +	spin_lock(vmf->ptl);
> +	old_pte = ptep_get(vmf->pte);
> +	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +	pte = pte_modify(old_pte, vma->vm_page_prot);

This is little weird, you're applying VMA protection bits to the PTE then
later asking about protection bits? Can't you just tell from the VMA flags?

Seems like do_numa_page() copy/pasta.

> +	writable = pte_write(pte);
> +	if (!writable && pte_write_upgrade &&
> +			can_change_pte_writable(vma, vmf->address, pte))
> +		writable = true;
> +	folio = vm_normal_folio(vma, vmf->address, pte);
> +	if (folio && folio_test_large(folio))
> +		numa_rebuild_large_mapping(vmf, vma, folio, pte,
> +				ignore_writable, pte_write_upgrade);
> +	else
> +		numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
> +				writable);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +

All of this seems to be duplicating aspects of do_numa_page().

Seems more sensible, if it makes sense to accept this change (I'm still a
bit dubious about changing the faulting mechanism here), we should probably
implement a 'do_non_present_acessible()' function that avoids duplication +
bit-rot, with #ifdef CONFIG_DAMON's in place, or even better, just a hook
into damon code that's static inline void {} if not enabled.

Note that the numa fault handling logic is called _even if numa balancing
is disabled_. So we'd avoid all the other changes too.

> +	damon_report_access(&access_report);
> +	return 0;
> +}
> +
>  /*
>   * These routines also need to handle stuff like marking pages dirty
>   * and/or accessed for architectures that don't do it in hardware (most
> @@ -6036,8 +6078,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	if (!pte_present(vmf->orig_pte))
>  		return do_swap_page(vmf);
>
> -	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> +	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
> +		if (sysctl_numa_balancing_mode == NUMA_BALANCING_DISABLED)
> +			return do_damon_page(vmf, false);

Making an assumption here that, even if damon is disabled, we should handle
a 'damon' fault is just icky, sorry.

Also are we 100% sure that there's no races here? When we disable numa
balancing do we correctly ensure that absolutely no racing NUMA balancing
faults can happen whatsever at this juncture?

Have you checked that and ensured that to be the case?

You really have to be 100% certain you're not going to wrongly handle NUMA
page faults this way, on a potentially non-CONFIG_DAMON kernel to boot.

Keep in mind fault handling is verrrry racey (purposefully) and can be done
under VMA read lock alone (and that's only very loosely a lock!).

This is just, yeah, concerning.

>  		return do_numa_page(vmf);
> +	}
>
>  	spin_lock(vmf->ptl);
>  	entry = vmf->orig_pte;
> @@ -6159,8 +6204,12 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  			return 0;
>  		}
>  		if (pmd_trans_huge(vmf.orig_pmd)) {
> -			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> +			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) {
> +				if (sysctl_numa_balancing_mode ==
> +						NUMA_BALANCING_DISABLED)
> +					return do_damon_page(&vmf, true);

Now we have _more_ weirdness around what CONFIG_TRANSPARENT_HUGEPAGE
enables/disables.

>  				return do_huge_pmd_numa_page(&vmf);

This function will be called only if THP is enabled, but now damon
overrides that...

Let's try and make this consistent.

> +			}
>
>  			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
>  			    !pmd_write(vmf.orig_pmd)) {
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 78bded7acf79..e8a76114e4f9 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -714,6 +714,11 @@ long change_protection(struct mmu_gather *tlb,
>  	WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA);
>  #endif
>
> +#ifdef CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
> +	if (cp_flags & MM_CP_DAMON)
> +		newprot = PAGE_NONE;
> +#endif

OK this just seems broken.

Firstly, again you're introducing behaviour that applies even if DAMON is
disabled. Let's please not.

And predicating on -the architecture even supporting NUMA balancing- seems
rather bizarre?

Secondly, somebody can disable/enable NUMA balancing, and thus you are now
allowing this function to, when somebody specifies 'DAMON', get NUMA
balancing fault handling??

If you don't bother checking whether NUMA balancing is disabled when you
set it, then boom - you've broken the faulting mechanism, but even if you
_do_, somebody can just switch it on again...

Sorry but unless I'm wildly missing something here we're going to need a
new faulting mechanism (if we even want to allow DAMON to have its own
fault handling that is!)

> +
>  	if (is_vm_hugetlb_page(vma))
>  		pages = hugetlb_change_protection(vma, start, end, newprot,
>  						  cp_flags);
> --
> 2.39.5

Cheers, Lorenzo