[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ed51639-604c-4e15-84ae-4bf3777f83c1@linux.dev>
Date: Mon, 10 Nov 2025 00:26:26 +0800
From: Lance Yang <lance.yang@...ux.dev>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Christian Borntraeger <borntraeger@...ux.ibm.com>,
Janosch Frank <frankja@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
David Hildenbrand <david@...hat.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>, Peter Xu <peterx@...hat.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>, Arnd Bergmann <arnd@...db.de>,
Zi Yan <ziy@...dia.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>, Nico Pache
<npache@...hat.com>, Ryan Roberts <ryan.roberts@....com>,
Dev Jain <dev.jain@....com>, Barry Song <baohua@...nel.org>,
Muchun Song <muchun.song@...ux.dev>, Oscar Salvador <osalvador@...e.de>,
Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Matthew Brost <matthew.brost@...el.com>,
Joshua Hahn <joshua.hahnjy@...il.com>, Rakie Kim <rakie.kim@...com>,
Byungchul Park <byungchul@...com>, Gregory Price <gourry@...rry.net>,
Ying Huang <ying.huang@...ux.alibaba.com>,
Alistair Popple <apopple@...dia.com>,
Axel Rasmussen <axelrasmussen@...gle.com>, Yuanchu Xie <yuanchu@...gle.com>,
Wei Xu <weixugc@...gle.com>, Kemeng Shi <shikemeng@...weicloud.com>,
Kairui Song <kasong@...cent.com>, Nhat Pham <nphamcs@...il.com>,
Baoquan He <bhe@...hat.com>, Chris Li <chrisl@...nel.org>,
SeongJae Park <sj@...nel.org>, Matthew Wilcox <willy@...radead.org>,
Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>,
Xu Xin <xu.xin16@....com.cn>, Chengming Zhou <chengming.zhou@...ux.dev>,
Jann Horn <jannh@...gle.com>, Miaohe Lin <linmiaohe@...wei.com>,
Naoya Horiguchi <nao.horiguchi@...il.com>, Pedro Falcato <pfalcato@...e.de>,
Pasha Tatashin <pasha.tatashin@...een.com>, Rik van Riel <riel@...riel.com>,
Harry Yoo <harry.yoo@...cle.com>, Hugh Dickins <hughd@...gle.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
linux-s390@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-arch@...r.kernel.org, damon@...ts.linux.dev
Subject: Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers
On 2025/11/9 01:08, Lorenzo Stoakes wrote:
> PTE markers were previously only concerned with UFFD-specific logic - that
> is, PTE entries with the UFFD WP marker set or those marked via
> UFFDIO_POISON.
>
> However since the introduction of guard markers in commit
> 7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
> been the case.
>
> Issues have been avoided as guard regions are not permitted in conjunction
> with UFFD, but it still leaves very confusing logic in place, most notably
> the misleading and poorly named pte_none_mostly() and
> huge_pte_none_mostly().
>
> This predicate returns true for PTE entries that ought to be treated as
> none, but only in certain circumstances, and on the assumption we are
> dealing with H/W poison markers or UFFD WP markers.
>
> This patch removes these functions and makes each invocation of these
> functions instead explicitly check what it needs to check.
>
> As part of this effort it introduces is_uffd_pte_marker() to explicitly
> determine if a marker in fact is used as part of UFFD or not.
>
> In the HMM logic we note that the only time we would need to check for a
> fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> for a guard marker), so only check for the UFFD WP case.
>
> While we're here we also refactor code to make it easier to understand.
>
> Reviewed-by: Vlastimil Babka <vbabka@...e.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> ---
> fs/userfaultfd.c | 83 +++++++++++++++++++----------------
> include/asm-generic/hugetlb.h | 8 ----
> include/linux/swapops.h | 18 --------
> include/linux/userfaultfd_k.h | 21 +++++++++
> mm/hmm.c | 2 +-
> mm/hugetlb.c | 47 ++++++++++----------
> mm/mincore.c | 17 +++++--
> mm/userfaultfd.c | 27 +++++++-----
> 8 files changed, 123 insertions(+), 100 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 54c6cc7fe9c6..04c66b5001d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> {
> struct vm_area_struct *vma = vmf->vma;
> pte_t *ptep, pte;
> - bool ret = true;
>
> assert_fault_locked(vmf);
>
> ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> if (!ptep)
> - goto out;
> + return true;
>
> - ret = false;
> pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
>
> /*
> * Lockless access: we're in a wait_event so it's ok if it
> - * changes under us. PTE markers should be handled the same as none
> - * ptes here.
> + * changes under us.
> */
> - if (huge_pte_none_mostly(pte))
> - ret = true;
> +
> + /* If missing entry, wait for handler. */
> + if (huge_pte_none(pte))
> + return true;
> + /* UFFD PTE markers require handling. */
> + if (is_uffd_pte_marker(pte))
> + return true;
> + /* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
> if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> - ret = true;
> -out:
> - return ret;
> + return true;
> +
> + /* Otherwise, if entry isn't present, let fault handler deal with it. */
> + return false;
> }
> #else
> static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> struct vm_fault *vmf,
> unsigned long reason)
> {
> - return false; /* should never get here */
> + /* Should never get here. */
> + VM_WARN_ON_ONCE(1);
> + return false;
> }
> #endif /* CONFIG_HUGETLB_PAGE */
>
> /*
> - * Verify the pagetables are still not ok after having reigstered into
> + * Verify the pagetables are still not ok after having registered into
> * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
> * userfault that has already been resolved, if userfaultfd_read_iter and
> * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
> @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> pmd_t *pmd, _pmd;
> pte_t *pte;
> pte_t ptent;
> - bool ret = true;
> + bool ret;
>
> assert_fault_locked(vmf);
>
> pgd = pgd_offset(mm, address);
> if (!pgd_present(*pgd))
> - goto out;
> + return true;
> p4d = p4d_offset(pgd, address);
> if (!p4d_present(*p4d))
> - goto out;
> + return true;
> pud = pud_offset(p4d, address);
> if (!pud_present(*pud))
> - goto out;
> + return true;
> pmd = pmd_offset(pud, address);
> again:
> _pmd = pmdp_get_lockless(pmd);
> if (pmd_none(_pmd))
> - goto out;
> + return true;
>
> - ret = false;
> if (!pmd_present(_pmd))
> - goto out;
> + return false;
>
> - if (pmd_trans_huge(_pmd)) {
> - if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> - ret = true;
> - goto out;
> - }
> + if (pmd_trans_huge(_pmd))
> + return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
>
> pte = pte_offset_map(pmd, address);
> - if (!pte) {
> - ret = true;
> + if (!pte)
> goto again;
> - }
> +
> /*
> * Lockless access: we're in a wait_event so it's ok if it
> - * changes under us. PTE markers should be handled the same as none
> - * ptes here.
> + * changes under us.
> */
> ptent = ptep_get(pte);
> - if (pte_none_mostly(ptent))
> - ret = true;
> +
> + ret = true;
> + /* If missing entry, wait for handler. */
> + if (pte_none(ptent))
> + goto out;
> + /* UFFD PTE markers require handling. */
> + if (is_uffd_pte_marker(ptent))
> + goto out;
> + /* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
> if (!pte_write(ptent) && (reason & VM_UFFD_WP))
> - ret = true;
> - pte_unmap(pte);
> + goto out;
>
> + ret = false;
> out:
> + pte_unmap(pte);
> return ret;
> }
>
> @@ -490,12 +498,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> set_current_state(blocking_state);
> spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>
> - if (!is_vm_hugetlb_page(vma))
> - must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> - else
> + if (is_vm_hugetlb_page(vma)) {
> must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
> - if (is_vm_hugetlb_page(vma))
> hugetlb_vma_unlock_read(vma);
> + } else {
> + must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> + }
> +
> release_fault_lock(vmf);
>
> if (likely(must_wait && !READ_ONCE(ctx->released))) {
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index dcb8727f2b82..e1a2e1b7c8e7 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -97,14 +97,6 @@ static inline int huge_pte_none(pte_t pte)
> }
> #endif
>
> -/* Please refer to comments above pte_none_mostly() for the usage */
> -#ifndef __HAVE_ARCH_HUGE_PTE_NONE_MOSTLY
> -static inline int huge_pte_none_mostly(pte_t pte)
> -{
> - return huge_pte_none(pte) || is_pte_marker(pte);
> -}
> -#endif
> -
> #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 2687928a8146..d1f665935cfc 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -469,24 +469,6 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
> (pte_marker_get(entry) & PTE_MARKER_GUARD);
> }
>
> -/*
> - * This is a special version to check pte_none() just to cover the case when
> - * the pte is a pte marker. It existed because in many cases the pte marker
> - * should be seen as a none pte; it's just that we have stored some information
> - * onto the none pte so it becomes not-none any more.
> - *
> - * It should be used when the pte is file-backed, ram-based and backing
> - * userspace pages, like shmem. It is not needed upon pgtables that do not
> - * support pte markers at all. For example, it's not needed on anonymous
> - * memory, kernel-only memory (including when the system is during-boot),
> - * non-ram based generic file-system. It's fine to be used even there, but the
> - * extra pte marker check will be pure overhead.
> - */
> -static inline int pte_none_mostly(pte_t pte)
> -{
> - return pte_none(pte) || is_pte_marker(pte);
> -}
> -
> static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> {
> struct page *p = pfn_to_page(swp_offset_pfn(entry));
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c0e716aec26a..da0b4fcc566f 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -479,4 +479,25 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
> return false;
> }
>
> +
> +static inline bool is_uffd_pte_marker(pte_t pte)
> +{
> + swp_entry_t entry;
> +
> + if (pte_present(pte))
> + return false;
> +
> + entry = pte_to_swp_entry(pte);
> + if (!is_pte_marker_entry(entry))
> + return false;
> +
> + /* UFFD WP, poisoned swap entries are UFFD handled. */
> + if (pte_marker_entry_uffd_wp(entry))
> + return true;
> + if (is_poisoned_swp_entry(entry))
> + return true;
> +
> + return false;
> +}
> +
> #endif /* _LINUX_USERFAULTFD_K_H */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index a56081d67ad6..43d4a91035ff 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> uint64_t pfn_req_flags = *hmm_pfn;
> uint64_t new_pfn_flags = 0;
>
> - if (pte_none_mostly(pte)) {
> + if (pte_none(pte) || pte_marker_uffd_wp(pte)) {
> required_fault =
> hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
> if (required_fault)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1ea459723cce..01c784547d1e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6743,29 +6743,28 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte);
> - if (huge_pte_none_mostly(vmf.orig_pte)) {
> - if (is_pte_marker(vmf.orig_pte)) {
> - pte_marker marker =
> - pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
> -
> - if (marker & PTE_MARKER_POISONED) {
> - ret = VM_FAULT_HWPOISON_LARGE |
> - VM_FAULT_SET_HINDEX(hstate_index(h));
> - goto out_mutex;
> - } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
> - /* This isn't supported in hugetlb. */
> - ret = VM_FAULT_SIGSEGV;
> - goto out_mutex;
> - }
> - }
> -
> + if (huge_pte_none(vmf.orig_pte))
> /*
> - * Other PTE markers should be handled the same way as none PTE.
> - *
> * hugetlb_no_page will drop vma lock and hugetlb fault
> * mutex internally, which make us return immediately.
> */
> return hugetlb_no_page(mapping, &vmf);
> +
> + if (is_pte_marker(vmf.orig_pte)) {
> + const pte_marker marker =
> + pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
> +
> + if (marker & PTE_MARKER_POISONED) {
> + ret = VM_FAULT_HWPOISON_LARGE |
> + VM_FAULT_SET_HINDEX(hstate_index(h));
> + goto out_mutex;
> + } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
> + /* This isn't supported in hugetlb. */
> + ret = VM_FAULT_SIGSEGV;
> + goto out_mutex;
> + }
> +
> + return hugetlb_no_page(mapping, &vmf);
> }
>
> ret = 0;
> @@ -6934,6 +6933,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> int ret = -ENOMEM;
> struct folio *folio;
> bool folio_in_pagecache = false;
> + pte_t dst_ptep;
>
> if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
> ptl = huge_pte_lock(h, dst_mm, dst_pte);
> @@ -7073,13 +7073,14 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> if (folio_test_hwpoison(folio))
> goto out_release_unlock;
>
> + ret = -EEXIST;
> +
> + dst_ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> /*
> - * We allow to overwrite a pte marker: consider when both MISSING|WP
> - * registered, we firstly wr-protect a none pte which has no page cache
> - * page backing it, then access the page.
> + * See comment about UFFD marker overwriting in
> + * mfill_atomic_install_pte().
> */
> - ret = -EEXIST;
> - if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte)))
> + if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> goto out_release_unlock;
>
> if (folio_in_pagecache)
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 8ec4719370e1..151b2dbb783b 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -32,11 +32,22 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
> spinlock_t *ptl;
>
> ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
> +
> /*
> * Hugepages under user process are always in RAM and never
> * swapped out, but theoretically it needs to be checked.
> */
> - present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
> + if (!pte) {
> + present = 0;
> + } else {
> + const pte_t ptep = huge_ptep_get(walk->mm, addr, pte);
> +
> + if (huge_pte_none(ptep) || is_pte_marker(ptep))
> + present = 0;
> + else
> + present = 1;
> + }
> +
> for (; addr != end; vec++, addr += PAGE_SIZE)
> *vec = present;
> walk->private = vec;
> @@ -175,8 +186,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pte_t pte = ptep_get(ptep);
>
> step = 1;
> - /* We need to do cache lookup too for pte markers */
> - if (pte_none_mostly(pte))
> + /* We need to do cache lookup too for UFFD pte markers */
> + if (pte_none(pte) || is_uffd_pte_marker(pte))
Seems like something is changed, new is_uffd_pte_marker check will
miss non-UFFD markers (like guard markers) , and then would fall
through to the swap entry logic to be misreported as resident by
mincore_swap().
```
/* We need to do cache lookup too for UFFD pte markers */
if (pte_none(pte) || is_uffd_pte_marker(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
else if (pte_present(pte)) {
unsigned int batch = pte_batch_hint(ptep, pte);
if (batch > 1) {
unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
step = min_t(unsigned int, batch, max_nr);
}
for (i = 0; i < step; i++)
vec[i] = 1;
} else { /* pte is a swap entry */
*vec = mincore_swap(pte_to_swp_entry(pte), false);
}
```
Wouldn't the generic is_pte_marker() be safer here?
Thanks,
Lance
> __mincore_unmapped_range(addr, addr + PAGE_SIZE,
> vma, vec);
> else if (pte_present(pte)) {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 00122f42718c..cc4ce205bbec 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -178,6 +178,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> spinlock_t *ptl;
> struct folio *folio = page_folio(page);
> bool page_in_cache = folio_mapping(folio);
> + pte_t dst_ptep;
>
> _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> _dst_pte = pte_mkdirty(_dst_pte);
> @@ -199,12 +200,15 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> }
>
> ret = -EEXIST;
> +
> + dst_ptep = ptep_get(dst_pte);
> +
> /*
> - * We allow to overwrite a pte marker: consider when both MISSING|WP
> - * registered, we firstly wr-protect a none pte which has no page cache
> - * page backing it, then access the page.
> + * We are allowed to overwrite a UFFD pte marker: consider when both
> + * MISSING|WP registered, we firstly wr-protect a none pte which has no
> + * page cache page backing it, then access the page.
> */
> - if (!pte_none_mostly(ptep_get(dst_pte)))
> + if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> goto out_unlock;
>
> if (page_in_cache) {
> @@ -583,12 +587,15 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> goto out_unlock;
> }
>
> - if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
> - !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) {
> - err = -EEXIST;
> - hugetlb_vma_unlock_read(dst_vma);
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - goto out_unlock;
> + if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> + const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> +
> + if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> + err = -EEXIST;
> + hugetlb_vma_unlock_read(dst_vma);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + goto out_unlock;
> + }
> }
>
> err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr,
Powered by blists - more mailing lists