lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0032ac8b-06ba-4f4b-ad66-f0195eea1c15@kernel.org>
Date: Mon, 9 Feb 2026 16:35:47 +0100
From: "David Hildenbrand (Arm)" <david@...nel.org>
To: Peter Xu <peterx@...hat.com>, Mike Rapoport <rppt@...nel.org>
Cc: linux-mm@...ck.org, Andrea Arcangeli <aarcange@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Axel Rasmussen <axelrasmussen@...gle.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Hugh Dickins
 <hughd@...gle.com>, James Houghton <jthoughton@...gle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Michal Hocko
 <mhocko@...e.com>, Muchun Song <muchun.song@...ux.dev>,
 Nikita Kalyazin <kalyazin@...zon.com>, Oscar Salvador <osalvador@...e.de>,
 Paolo Bonzini <pbonzini@...hat.com>, Sean Christopherson
 <seanjc@...gle.com>, Shuah Khan <shuah@...nel.org>,
 Suren Baghdasaryan <surenb@...gle.com>, Vlastimil Babka <vbabka@...e.cz>,
 linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
 linux-kselftest@...r.kernel.org
Subject: Re: [PATCH RFC 00/17] mm, kvm: allow uffd suppot in guest_memfd

On 2/3/26 21:56, Peter Xu wrote:
> On Tue, Jan 27, 2026 at 09:29:19PM +0200, Mike Rapoport wrote:
>> From: "Mike Rapoport (Microsoft)" <rppt@...nel.org>
>>
>> Hi,
>>
>> These patches enable support for userfaultfd in guest_memfd.
>> They are quite different from the latest posting [1] so I'm restarting the
>> versioning. As there was a lot of tension around the topic, this is an RFC
>> to get some feedback and see how we can move forward.
>>
>> As the ground work I refactored userfaultfd handling of PTE-based memory types
>> (anonymous and shmem) and converted them to use vm_uffd_ops for allocating a
>> folio or getting an existing folio from the page cache. shmem also implements
>> callbacks that add a folio to the page cache after the data passed in
>> UFFDIO_COPY was copied and remove the folio from the page cache if page table
>> update fails.
>>
>> In order for guest_memfd to notify userspace about page faults, there are new
>> VM_FAULT_UFFD_MINOR and VM_FAULT_UFFD_MISSING that a ->fault() handler can
>> return to inform the page fault handler that it needs to call
>> handle_userfault() to complete the fault.
>>
>> Nikita helped to plumb these new goodies into guest_memfd and provided basic
>> tests to verify that guest_memfd works with userfaultfd.
>>
>> I deliberately left hugetlb out, at least for the most part.
>> hugetlb handles acquisition of VMA and more importantly establishing of parent
>> page table entry differently than PTE-based memory types. This is a different
>> abstraction level than what vm_uffd_ops provides and people objected to
>> exposing such low level APIs as a part of VMA operations.
>>
>> Also, to enable uffd in guest_memfd refactoring of hugetlb is not needed and I
>> prefer to delay it until the dust settles after the changes in this set.
>>
>> [1] https://lore.kernel.org/all/20251130111812.699259-1-rppt@kernel.org
>>
>> Mike Rapoport (Microsoft) (12):
>>    userfaultfd: introduce mfill_copy_folio_locked() helper
>>    userfaultfd: introduce struct mfill_state
>>    userfaultfd: introduce mfill_get_pmd() helper.
>>    userfaultfd: introduce mfill_get_vma() and mfill_put_vma()
>>    userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy()
>>    userfaultfd: move vma_can_userfault out of line
>>    userfaultfd: introduce vm_uffd_ops
>>    userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE
>>    userfaultfd: introduce vm_uffd_ops->alloc_folio()
>>    shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
>>    userfaultfd: mfill_atomic() remove retry logic
>>    mm: introduce VM_FAULT_UFFD_MINOR fault reason
>>
>> Nikita Kalyazin (5):
>>    mm: introduce VM_FAULT_UFFD_MISSING fault reason
>>    KVM: guest_memfd: implement userfaultfd minor mode
>>    KVM: guest_memfd: implement userfaultfd missing mode
>>    KVM: selftests: test userfaultfd minor for guest_memfd
>>    KVM: selftests: test userfaultfd missing for guest_memfd
>>
>>   include/linux/mm.h                            |   5 +
>>   include/linux/mm_types.h                      |  15 +-
>>   include/linux/shmem_fs.h                      |  14 -
>>   include/linux/userfaultfd_k.h                 |  74 +-
>>   mm/hugetlb.c                                  |  21 +
>>   mm/memory.c                                   |   8 +-
>>   mm/shmem.c                                    | 188 +++--
>>   mm/userfaultfd.c                              | 671 ++++++++++--------
>>   .../testing/selftests/kvm/guest_memfd_test.c  | 191 +++++
>>   virt/kvm/guest_memfd.c                        | 134 +++-
>>   10 files changed, 871 insertions(+), 450 deletions(-)
> 
> Mike,
> 
> The idea looks good to me, thanks for this work!  Your process on
> UFFDIO_COPY over anon/shmem is nice to me.
> 
> If you remember, I used to raise a concern on introducing two new fault
> retvals only for userfaultfd:
> 
> https://lore.kernel.org/all/aShb8J18BaRrsA-u@x1.local/
> 
> IMHO they're not only unnecessarily leaking userfaultfd information into
> fault core definitions, but also cause code duplications.  I still think we
> should avoid them.
> 
> This time, I've attached a smoke tested patch removing both of them.
> 
> It's pretty small and it runs all fine with all old/new userfaultfd tests
> (including gmem ones). Feel free to have a look at the end.
> 
> I understand you want to avoid adding mnore complexity to this series, if
> you want I can also prepare such a patch after this series landed to remove
> the two retvals.  I'd still would like to know how you think about it,
> though, let me know if you have any comments.
> 
> Note that it may indeed need some perf tests to make sure there's zero
> overhead after this change.  Currently there's still some trivial overheads
> (e.g. unnecessary folio locks), but IIUC we can even avoid that.
> 
> Thanks,
> 
> ===8<===
>  From 5379d084494b17281f3e5365104a7edbdbe53759 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@...hat.com>
> Date: Tue, 3 Feb 2026 15:07:58 -0500
> Subject: [PATCH] mm/userfaultfd: Remove two userfaultfd fault retvals
> 
> They're not needed when with vm_uffd_ops.  We can remove both of them.
> Actually, another side benefit is drivers do not need to process
> userfaultfd missing / minor faults anymore in the main fault handler.
> 
> This patch will make get_folio_noalloc() required for either MISSING or
> MINOR fault, but that's not a problem, as it should be lightweight and the
> current only outside-mm user (gmem) will support both anyway.
> 
> Signed-off-by: Peter Xu <peterx@...hat.com>
> ---
>   include/linux/mm_types.h      | 15 +-----------
>   include/linux/userfaultfd_k.h |  2 +-
>   mm/memory.c                   | 45 +++++++++++++++++++++++++++++------
>   mm/shmem.c                    | 12 ----------
>   virt/kvm/guest_memfd.c        | 20 ----------------
>   5 files changed, 40 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a6d32470a78a3..3cc8ae7228860 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1612,10 +1612,6 @@ typedef __bitwise unsigned int vm_fault_t;
>    *				fsync() to complete (for synchronous page faults
>    *				in DAX)
>    * @VM_FAULT_COMPLETED:		->fault completed, meanwhile mmap lock released
> - * @VM_FAULT_UFFD_MINOR:	->fault did not modify page tables and needs
> - *				handle_userfault(VM_UFFD_MINOR) to complete
> - * @VM_FAULT_UFFD_MISSING:	->fault did not modify page tables and needs
> - *				handle_userfault(VM_UFFD_MISSING) to complete
>    * @VM_FAULT_HINDEX_MASK:	mask HINDEX value
>    *
>    */
> @@ -1633,13 +1629,6 @@ enum vm_fault_reason {
>   	VM_FAULT_DONE_COW       = (__force vm_fault_t)0x001000,
>   	VM_FAULT_NEEDDSYNC      = (__force vm_fault_t)0x002000,
>   	VM_FAULT_COMPLETED      = (__force vm_fault_t)0x004000,
> -#ifdef CONFIG_USERFAULTFD
> -	VM_FAULT_UFFD_MINOR	= (__force vm_fault_t)0x008000,
> -	VM_FAULT_UFFD_MISSING	= (__force vm_fault_t)0x010000,
> -#else
> -	VM_FAULT_UFFD_MINOR	= (__force vm_fault_t)0x000000,
> -	VM_FAULT_UFFD_MISSING	= (__force vm_fault_t)0x000000,
> -#endif
>   	VM_FAULT_HINDEX_MASK    = (__force vm_fault_t)0x0f0000,
>   };
>   
> @@ -1664,9 +1653,7 @@ enum vm_fault_reason {
>   	{ VM_FAULT_FALLBACK,            "FALLBACK" },	\
>   	{ VM_FAULT_DONE_COW,            "DONE_COW" },	\
>   	{ VM_FAULT_NEEDDSYNC,           "NEEDDSYNC" },	\
> -	{ VM_FAULT_COMPLETED,           "COMPLETED" },	\
> -	{ VM_FAULT_UFFD_MINOR,		"UFFD_MINOR" }, \
> -	{ VM_FAULT_UFFD_MISSING,	"UFFD_MISSING" }
> +	{ VM_FAULT_COMPLETED,           "COMPLETED" }
>   
>   struct vm_special_mapping {
>   	const char *name;	/* The name, e.g. "[vdso]". */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 75d5b09f2560c..5923e32de53b5 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -85,7 +85,7 @@ struct vm_uffd_ops {
>   	/* Checks if a VMA can support userfaultfd */
>   	bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags);
>   	/*
> -	 * Called to resolve UFFDIO_CONTINUE request.
> +	 * Required by any uffd driver for either MISSING or MINOR fault.
>   	 * Should return the folio found at pgoff in the VMA's pagecache if it
>   	 * exists or ERR_PTR otherwise.
>   	 * The returned folio is locked and with reference held.
> diff --git a/mm/memory.c b/mm/memory.c
> index 456344938c72b..098febb761acc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5338,6 +5338,33 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>   	return VM_FAULT_OOM;
>   }
>   
> +static vm_fault_t fault_process_userfaultfd(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct inode *inode = file_inode(vma->vm_file);
> +	/*
> +	 * NOTE: we could double check this hook present when
> +	 * UFFDIO_REGISTER on MISSING or MINOR for a file driver.
> +	 */
> +	struct folio *folio =
> +	    vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff);
> +
> +	if (!IS_ERR_OR_NULL(folio)) {
> +		/*
> +		 * TODO: provide a flag for get_folio_noalloc() to avoid
> +		 * locking (or even the extra reference?)
> +		 */
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		if (userfaultfd_minor(vma))
> +			return handle_userfault(vmf, VM_UFFD_MINOR);
> +	} else {
> +		return handle_userfault(vmf, VM_UFFD_MISSING);
> +	}
> +
> +	return 0;
> +}
> +
>   /*
>    * The mmap_lock must have been held on entry, and may have been
>    * released depending on flags and vma->vm_ops->fault() return value.
> @@ -5370,16 +5397,20 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>   			return VM_FAULT_OOM;
>   	}
>   
> +	/*
> +	 * If this is an userfaultfd trap, process it in advance before
> +	 * triggering the genuine fault handler.
> +	 */
> +	if (userfaultfd_missing(vma) || userfaultfd_minor(vma)) {
> +		ret = fault_process_userfaultfd(vmf);
> +		if (ret)
> +			return ret;
> +	}
> +
>   	ret = vma->vm_ops->fault(vmf);
>   	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> -			    VM_FAULT_DONE_COW | VM_FAULT_UFFD_MINOR |
> -			    VM_FAULT_UFFD_MISSING))) {
> -		if (ret & VM_FAULT_UFFD_MINOR)
> -			return handle_userfault(vmf, VM_UFFD_MINOR);
> -		if (ret & VM_FAULT_UFFD_MISSING)
> -			return handle_userfault(vmf, VM_UFFD_MISSING);
> +			    VM_FAULT_DONE_COW)))
>   		return ret;
> -	}
>   
>   	folio = page_folio(vmf->page);
>   	if (unlikely(PageHWPoison(vmf->page))) {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index eafd7986fc2ec..5286f28b3e443 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2484,13 +2484,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>   	fault_mm = vma ? vma->vm_mm : NULL;
>   
>   	folio = filemap_get_entry(inode->i_mapping, index);
> -	if (folio && vma && userfaultfd_minor(vma)) {
> -		if (!xa_is_value(folio))
> -			folio_put(folio);
> -		*fault_type = VM_FAULT_UFFD_MINOR;
> -		return 0;
> -	}
> -
>   	if (xa_is_value(folio)) {
>   		error = shmem_swapin_folio(inode, index, &folio,
>   					   sgp, gfp, vma, fault_type);
> @@ -2535,11 +2528,6 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>   	 * Fast cache lookup and swap lookup did not find it: allocate.
>   	 */
>   
> -	if (vma && userfaultfd_missing(vma)) {
> -		*fault_type = VM_FAULT_UFFD_MISSING;
> -		return 0;
> -	}
> -
>   	/* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */
>   	orders = shmem_allowable_huge_orders(inode, vma, index, write_end, false);
>   	if (orders > 0) {
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 14cca057fc0ec..bd0de685f42f8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -421,26 +421,6 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>   	folio = __filemap_get_folio(inode->i_mapping, vmf->pgoff,
>   				    FGP_LOCK | FGP_ACCESSED, 0);
>   
> -	if (userfaultfd_armed(vmf->vma)) {
> -		/*
> -		 * If userfaultfd is registered in minor mode and a folio
> -		 * exists, return VM_FAULT_UFFD_MINOR to trigger the
> -		 * userfaultfd handler.
> -		 */
> -		if (userfaultfd_minor(vmf->vma) && !IS_ERR_OR_NULL(folio)) {
> -			ret = VM_FAULT_UFFD_MINOR;
> -			goto out_folio;
> -		}
> -
> -		/*
> -		 * Check if userfaultfd is registered in missing mode. If so,
> -		 * check if a folio exists in the page cache. If not, return
> -		 * VM_FAULT_UFFD_MISSING to trigger the userfaultfd handler.
> -		 */
> -		if (userfaultfd_missing(vmf->vma) && IS_ERR_OR_NULL(folio))
> -			return VM_FAULT_UFFD_MISSING;
> -	}
> -
>   	/* folio not in the pagecache, try to allocate */
>   	if (IS_ERR(folio))
>   		folio = __kvm_gmem_folio_alloc(inode, vmf->pgoff);

That looks better in general. We should likely find a better/more 
consistent name for fault_process_userfaultfd().

-- 
Cheers,

David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ