linux-kernel - Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2097f155-c459-40e1-93e8-3d501ae66b42@amazon.com>
Date: Fri, 20 Jun 2025 13:00:24 +0100
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Peter Xu <peterx@...hat.com>
CC: <akpm@...ux-foundation.org>, <pbonzini@...hat.com>, <shuah@...nel.org>,
	<viro@...iv.linux.org.uk>, <brauner@...nel.org>, <muchun.song@...ux.dev>,
	<hughd@...gle.com>, <kvm@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<linux-fsdevel@...r.kernel.org>, <jack@...e.cz>,
	<lorenzo.stoakes@...cle.com>, <Liam.Howlett@...cle.com>, <jannh@...gle.com>,
	<ryan.roberts@....com>, <david@...hat.com>, <jthoughton@...gle.com>,
	<graf@...zon.de>, <jgowans@...zon.com>, <roypat@...zon.co.uk>,
	<derekmn@...zon.com>, <nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non
 hugetlbfs

On 11/06/2025 13:56, Peter Xu wrote:
> On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 10/06/2025 23:22, Peter Xu wrote:
>>> On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
>>>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>>>> non-huge pages by calling vm_ops->fault().  A new VMF flag,
>>>> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
>>>> handle_userfault().
>>>
>>> It's not clear yet on why this is needed to be generalized out of the blue.
>>>
>>> Some mentioning of guest_memfd use case might help for other reviewers, or
>>> some mention of the need to introduce userfaultfd support in kernel
>>> modules.
>>
>> Hi Peter,
>>
>> Sounds fair, thank you.
>>
>>>>
>>>> Suggested-by: James Houghton <jthoughton@...gle.com>
>>>> Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
>>>> ---
>>>>    include/linux/mm_types.h |  4 ++++
>>>>    mm/hugetlb.c             |  2 +-
>>>>    mm/shmem.c               |  9 ++++++---
>>>>    mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
>>>>    4 files changed, 38 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 0234f14f2aa6..2f26ee9742bf 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>>>>     * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>>>>     *                        We should only access orig_pte if this flag set.
>>>>     * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>>>> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
>>>> + *                                 minor handler as it is being called by the
>>>> + *                                 userfaultfd code itself.
>>>
>>> We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
>>> as it's not easy to follow when without userfault minor context.  It might
>>> be better to use generic terms like NO_USERFAULT.
>>
>> Yes, I agree, can name it more generically.
>>
>>> Said that, I wonder if we'll need to add a vm_ops anyway in the latter
>>> patch, whether we can also avoid reusing fault() but instead resolve the
>>> page faults using the vm_ops hook too.  That might be helpful because then
>>> we can avoid this new FAULT_FLAG_* that is totally not useful to
>>> non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
>>> struct below just to suite the current fault() interfacing.
>>
>> I'm not sure I fully understand that.  Calling fault() op helps us reuse the
>> FS specifics when resolving the fault.  I get that the new op can imply the
>> userfault flag so the flag doesn't need to be exposed to mm, but doing so
>> will bring duplication of the logic within FSes between this new op and the
>> fault(), unless we attempt to factor common parts out.  For example, for
>> shmem_get_folio_gfp(), we would still need to find a way to suppress the
>> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
>> op.  Is that what you're proposing?
> 
> Yes it is what I was proposing.  shmem_get_folio_gfp() always has that
> handling when vmf==NULL, then vma==NULL and userfault will be skipped.
> 
> So what I was thinking is one vm_ops.userfaultfd_request(req), where req
> can be:
> 
>    (1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
>        both MISSING/WP/MINOR.  Here WP should mean sync-wp tracking, async
>        was so far by default almost supported everywhere except
>        VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
>        as of now (even if I think it shouldn't be hard to support the rest
>        two..).
> 
>    (2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
>        defined to suite userfault's need on fault resolutions.  It likely
>        doesn't need vmf as the parameter, but likely (when anon isn't taking
>        into account, after all anon have vm_ops==NULL..) the inode and
>        offsets, perhaps some flag would be needed to identify MISSING or
>        MINOR faults, for example.
> 
> Maybe some more.
> 
> I was even thinking whether we could merge hugetlb into the picture too on
> generalize its fault resolutions.  Hugetlb was always special, maye this is
> a chance too to make it generalized, but it doesn't need to happen in one
> shot even if it could work.  We could start with shmem.
> 
> So this does sound like slightly involved, and I'm not yet 100% sure this
> will work, but likely.  If you want, I can take a stab at this this week or
> next just to see whether it'll work in general.  I also don't expect this
> to depend on guest-memfd at all - it can be alone a refactoring making
> userfault module-ready.

Thanks for explaining that.  I played a bit with it myself and it 
appears to be working for the MISSING mode for both shmem and 
guest_memfd.  Attaching my sketch below.  Please let me know if this is 
how you see it.

I found that arguments and return values are significantly different 
between the two request types, which may be a bit confusing, although we 
do not expect many callers of those.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..eb30b23b24d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -603,6 +603,16 @@ struct vm_fault {
  					 */
  };

+#ifdef CONFIG_USERFAULTFD
+/*
+ * Used by userfaultfd_request().
+ */
+enum uffd_req {
+	UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */
+	UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */
+};
+#endif
+
  /*
   * These are the virtual MM functions - opening of an area, closing and
   * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -680,6 +690,22 @@ struct vm_operations_struct {
  	 */
  	struct page *(*find_special_page)(struct vm_area_struct *vma,
  					  unsigned long addr);
+
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Called by the userfaultfd code to query supported modes or request
+	 * fault resolution.
+	 * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask
+	 * of modes as in struct uffdio_register.  No other arguments are
+	 * used.
+	 * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault
+	 * using the mode specified in the mode argument. The inode, pgoff and
+	 * foliop arguments must be set accordingly.
+	 */
+	int (*userfaultfd_request)(enum uffd_req req, int mode,
+				   struct inode *inode, pgoff_t pgoff,
+				   struct folio **foliop);
+#endif
  };

  #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..1cabb925da0e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  		return false;

  	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+	    (!is_vm_hugetlb_page(vma) &&
+	     !vma->vm_ops->userfaultfd_request &&
+	     !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						NULL, 0, NULL) &
+	           UFFDIO_REGISTER_MODE_MINOR)))
  		return false;

  	/*
@@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  #endif

  	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	return vma_is_anonymous(vma) ||
+	    is_vm_hugetlb_page(vma) ||
+	    (vma->vm_ops->userfaultfd_request &&
+	     vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL,
+					      0, NULL));
  }

  static inline bool vma_has_uffd_without_event_remap(struct 
vm_area_struct *vma)
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..a5b5c4131dcf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct 
address_space *mapping,
  	return 0;
  }

+#ifdef CONFIG_USERFAULTFD
+static int shmem_userfaultfd_request(enum uffd_req req, int mode,
+				     struct inode *inode, pgoff_t pgoff,
+				     struct folio **foliop)
+{
+	int ret;
+
+	switch (req) {
+	case UFFD_REQ_GET_SUPPORTED:
+		ret =
+		    UFFDIO_REGISTER_MODE_MISSING |
+		    UFFDIO_REGISTER_MODE_WP |
+		    UFFDIO_REGISTER_MODE_MINOR;
+		break;
+	case UFFD_REQ_FAULT_RESOLVE:
+		ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC);
+		if (ret == -ENOENT)
+			ret = -EFAULT;
+		if (ret)
+			break;
+		if (!*foliop) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+#endif
+
  static const struct address_space_operations shmem_aops = {
  	.writepage	= shmem_writepage,
  	.dirty_folio	= noop_dirty_folio,
@@ -5306,6 +5340,9 @@ static const struct vm_operations_struct 
shmem_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5315,6 +5352,9 @@ static const struct vm_operations_struct 
shmem_anon_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..efc150bf5691 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
  	struct page *page;
  	int ret;

-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
-	/* Our caller expects us to return -EFAULT if we failed to find folio */
-	if (ret == -ENOENT)
-		ret = -EFAULT;
+	if (!dst_vma->vm_ops->userfaultfd_request ||
+	    !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						   NULL, 0, NULL) &
+	      UFFDIO_REGISTER_MODE_MINOR)) {
+		return -EFAULT;
+	}
+
+	ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE,
+						   UFFDIO_REGISTER_MODE_MINOR,
+						   inode, pgoff, &folio);
  	if (ret)
  		goto out;
-	if (!folio) {
-		ret = -EFAULT;
-		goto out;
-	}

  	page = folio_file_page(folio, pgoff);
  	if (PageHWPoison(page)) {
@@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct 
userfaultfd_ctx *ctx,
  		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
  					     src_start, len, flags);

-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+	if (!vma_is_anonymous(dst_vma) &&
+            (!dst_vma->vm_ops->userfaultfd_request ||
+	     (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						    NULL, 0, NULL))))
  		goto out_unlock;

  	while (src_addr < src_start + len) {

> 
> Thanks,
> 
> --
> Peter Xu
>