[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2097f155-c459-40e1-93e8-3d501ae66b42@amazon.com>
Date: Fri, 20 Jun 2025 13:00:24 +0100
From: Nikita Kalyazin <kalyazin@...zon.com>
To: Peter Xu <peterx@...hat.com>
CC: <akpm@...ux-foundation.org>, <pbonzini@...hat.com>, <shuah@...nel.org>,
<viro@...iv.linux.org.uk>, <brauner@...nel.org>, <muchun.song@...ux.dev>,
<hughd@...gle.com>, <kvm@...r.kernel.org>, <linux-kselftest@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<linux-fsdevel@...r.kernel.org>, <jack@...e.cz>,
<lorenzo.stoakes@...cle.com>, <Liam.Howlett@...cle.com>, <jannh@...gle.com>,
<ryan.roberts@....com>, <david@...hat.com>, <jthoughton@...gle.com>,
<graf@...zon.de>, <jgowans@...zon.com>, <roypat@...zon.co.uk>,
<derekmn@...zon.com>, <nsaenz@...zon.es>, <xmarcalx@...zon.com>
Subject: Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non
hugetlbfs
On 11/06/2025 13:56, Peter Xu wrote:
> On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 10/06/2025 23:22, Peter Xu wrote:
>>> On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
>>>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>>>> non-huge pages by calling vm_ops->fault(). A new VMF flag,
>>>> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
>>>> handle_userfault().
>>>
>>> It's not clear yet on why this is needed to be generalized out of the blue.
>>>
>>> Some mentioning of guest_memfd use case might help for other reviewers, or
>>> some mention of the need to introduce userfaultfd support in kernel
>>> modules.
>>
>> Hi Peter,
>>
>> Sounds fair, thank you.
>>
>>>>
>>>> Suggested-by: James Houghton <jthoughton@...gle.com>
>>>> Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
>>>> ---
>>>> include/linux/mm_types.h | 4 ++++
>>>> mm/hugetlb.c | 2 +-
>>>> mm/shmem.c | 9 ++++++---
>>>> mm/userfaultfd.c | 37 +++++++++++++++++++++++++++----------
>>>> 4 files changed, 38 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 0234f14f2aa6..2f26ee9742bf 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>>>> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>>>> * We should only access orig_pte if this flag set.
>>>> * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>>>> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
>>>> + * minor handler as it is being called by the
>>>> + * userfaultfd code itself.
>>>
>>> We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
>>> as it's not easy to follow when without userfault minor context. It might
>>> be better to use generic terms like NO_USERFAULT.
>>
>> Yes, I agree, can name it more generically.
>>
>>> Said that, I wonder if we'll need to add a vm_ops anyway in the latter
>>> patch, whether we can also avoid reusing fault() but instead resolve the
>>> page faults using the vm_ops hook too. That might be helpful because then
>>> we can avoid this new FAULT_FLAG_* that is totally not useful to
>>> non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
>>> struct below just to suite the current fault() interfacing.
>>
>> I'm not sure I fully understand that. Calling fault() op helps us reuse the
>> FS specifics when resolving the fault. I get that the new op can imply the
>> userfault flag so the flag doesn't need to be exposed to mm, but doing so
>> will bring duplication of the logic within FSes between this new op and the
>> fault(), unless we attempt to factor common parts out. For example, for
>> shmem_get_folio_gfp(), we would still need to find a way to suppress the
>> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
>> op. Is that what you're proposing?
>
> Yes it is what I was proposing. shmem_get_folio_gfp() always has that
> handling when vmf==NULL, then vma==NULL and userfault will be skipped.
>
> So what I was thinking is one vm_ops.userfaultfd_request(req), where req
> can be:
>
> (1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
> both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async
> was so far by default almost supported everywhere except
> VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
> as of now (even if I think it shouldn't be hard to support the rest
> two..).
>
> (2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
> defined to suite userfault's need on fault resolutions. It likely
> doesn't need vmf as the parameter, but likely (when anon isn't taking
> into account, after all anon have vm_ops==NULL..) the inode and
> offsets, perhaps some flag would be needed to identify MISSING or
> MINOR faults, for example.
>
> Maybe some more.
>
> I was even thinking whether we could merge hugetlb into the picture too on
> generalize its fault resolutions. Hugetlb was always special, maye this is
> a chance too to make it generalized, but it doesn't need to happen in one
> shot even if it could work. We could start with shmem.
>
> So this does sound like slightly involved, and I'm not yet 100% sure this
> will work, but likely. If you want, I can take a stab at this this week or
> next just to see whether it'll work in general. I also don't expect this
> to depend on guest-memfd at all - it can be alone a refactoring making
> userfault module-ready.
Thanks for explaining that. I played a bit with it myself and it
appears to be working for the MISSING mode for both shmem and
guest_memfd. Attaching my sketch below. Please let me know if this is
how you see it.
I found that arguments and return values are significantly different
between the two request types, which may be a bit confusing, although we
do not expect many callers of those.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..eb30b23b24d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -603,6 +603,16 @@ struct vm_fault {
*/
};
+#ifdef CONFIG_USERFAULTFD
+/*
+ * Used by userfaultfd_request().
+ */
+enum uffd_req {
+ UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */
+ UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */
+};
+#endif
+
/*
* These are the virtual MM functions - opening of an area, closing and
* unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -680,6 +690,22 @@ struct vm_operations_struct {
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
unsigned long addr);
+
+#ifdef CONFIG_USERFAULTFD
+ /*
+ * Called by the userfaultfd code to query supported modes or request
+ * fault resolution.
+ * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask
+ * of modes as in struct uffdio_register. No other arguments are
+ * used.
+ * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault
+ * using the mode specified in the mode argument. The inode, pgoff and
+ * foliop arguments must be set accordingly.
+ */
+ int (*userfaultfd_request)(enum uffd_req req, int mode,
+ struct inode *inode, pgoff_t pgoff,
+ struct folio **foliop);
+#endif
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..1cabb925da0e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct
vm_area_struct *vma,
return false;
if ((vm_flags & VM_UFFD_MINOR) &&
- (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+ (!is_vm_hugetlb_page(vma) &&
+ !vma->vm_ops->userfaultfd_request &&
+ !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL) &
+ UFFDIO_REGISTER_MODE_MINOR)))
return false;
/*
@@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct
vm_area_struct *vma,
#endif
/* By default, allow any of anon|shmem|hugetlb */
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
+ return vma_is_anonymous(vma) ||
+ is_vm_hugetlb_page(vma) ||
+ (vma->vm_ops->userfaultfd_request &&
+ vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL,
+ 0, NULL));
}
static inline bool vma_has_uffd_without_event_remap(struct
vm_area_struct *vma)
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..a5b5c4131dcf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct
address_space *mapping,
return 0;
}
+#ifdef CONFIG_USERFAULTFD
+static int shmem_userfaultfd_request(enum uffd_req req, int mode,
+ struct inode *inode, pgoff_t pgoff,
+ struct folio **foliop)
+{
+ int ret;
+
+ switch (req) {
+ case UFFD_REQ_GET_SUPPORTED:
+ ret =
+ UFFDIO_REGISTER_MODE_MISSING |
+ UFFDIO_REGISTER_MODE_WP |
+ UFFDIO_REGISTER_MODE_MINOR;
+ break;
+ case UFFD_REQ_FAULT_RESOLVE:
+ ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC);
+ if (ret == -ENOENT)
+ ret = -EFAULT;
+ if (ret)
+ break;
+ if (!*foliop) {
+ ret = -EFAULT;
+ break;
+ }
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+#endif
+
static const struct address_space_operations shmem_aops = {
.writepage = shmem_writepage,
.dirty_folio = noop_dirty_folio,
@@ -5306,6 +5340,9 @@ static const struct vm_operations_struct
shmem_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .userfaultfd_request = shmem_userfaultfd_request,
+#endif
};
static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5315,6 +5352,9 @@ static const struct vm_operations_struct
shmem_anon_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_USERFAULTFD
+ .userfaultfd_request = shmem_userfaultfd_request,
+#endif
};
int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..efc150bf5691 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
struct page *page;
int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
- /* Our caller expects us to return -EFAULT if we failed to find folio */
- if (ret == -ENOENT)
- ret = -EFAULT;
+ if (!dst_vma->vm_ops->userfaultfd_request ||
+ !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL) &
+ UFFDIO_REGISTER_MODE_MINOR)) {
+ return -EFAULT;
+ }
+
+ ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE,
+ UFFDIO_REGISTER_MODE_MINOR,
+ inode, pgoff, &folio);
if (ret)
goto out;
- if (!folio) {
- ret = -EFAULT;
- goto out;
- }
page = folio_file_page(folio, pgoff);
if (PageHWPoison(page)) {
@@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct
userfaultfd_ctx *ctx,
return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
- goto out_unlock;
- if (!vma_is_shmem(dst_vma) &&
- uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ if (!vma_is_anonymous(dst_vma) &&
+ (!dst_vma->vm_ops->userfaultfd_request ||
+ (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+ NULL, 0, NULL))))
goto out_unlock;
while (src_addr < src_start + len) {
>
> Thanks,
>
> --
> Peter Xu
>
Powered by blists - more mailing lists