[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HW2bFsFqifTytY8+Fy1nH-arFZ2JqAKi44_4nMtkPksDA@mail.gmail.com>
Date: Wed, 2 Apr 2025 12:04:48 -0700
From: James Houghton <jthoughton@...gle.com>
To: Nikita Kalyazin <kalyazin@...zon.com>
Cc: akpm@...ux-foundation.org, pbonzini@...hat.com, shuah@...nel.org,
kvm@...r.kernel.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, lorenzo.stoakes@...cle.com,
david@...hat.com, ryan.roberts@....com, quic_eberman@...cinc.com,
peterx@...hat.com, graf@...zon.de, jgowans@...zon.com, roypat@...zon.co.uk,
derekmn@...zon.com, nsaenz@...zon.es, xmarcalx@...zon.com
Subject: Re: [PATCH v2 1/5] mm: userfaultfd: generic continue for non hugetlbfs
On Wed, Apr 2, 2025 at 9:07 AM Nikita Kalyazin <kalyazin@...zon.com> wrote:
>
> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
> non-huge pages by calling vm_ops->fault(). A new VMF flag,
> FAULT_FLAG_NO_USERFAULT_MINOR, is introduced to avoid recursive call to
> handle_userfault().
>
> Signed-off-by: Nikita Kalyazin <kalyazin@...zon.com>
> ---
> include/linux/mm_types.h | 3 +++
> mm/hugetlb.c | 2 +-
> mm/shmem.c | 3 ++-
> mm/userfaultfd.c | 25 ++++++++++++++++++-------
> 4 files changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6..91a00f2cd565 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1429,6 +1429,8 @@ enum tlb_flush_reason {
> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> * We should only access orig_pte if this flag set.
> * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
> + * @FAULT_FLAG_NO_USERFAULT_MINOR: The fault handler must not call userfaultfd
> + * minor handler.
Perhaps instead a flag that says to avoid the userfaultfd minor fault
handler, maybe we should have a flag to indicate that vm_ops->fault()
has been called by UFFDIO_CONTINUE. See below.
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -1467,6 +1469,7 @@ enum fault_flag {
> FAULT_FLAG_UNSHARE = 1 << 10,
> FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
> FAULT_FLAG_VMA_LOCK = 1 << 12,
> + FAULT_FLAG_NO_USERFAULT_MINOR = 1 << 13,
> };
>
> typedef unsigned int __bitwise zap_flags_t;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 97930d44d460..ba90d48144fc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6228,7 +6228,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> }
>
> /* Check for page in userfault range. */
> - if (userfaultfd_minor(vma)) {
> + if (userfaultfd_minor(vma) && !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
> folio_unlock(folio);
> folio_put(folio);
> /* See comment in userfaultfd_missing() block above */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1ede0800e846..5e1911e39dec 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2467,7 +2467,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> fault_mm = vma ? vma->vm_mm : NULL;
>
> folio = filemap_get_entry(inode->i_mapping, index);
> - if (folio && vma && userfaultfd_minor(vma)) {
> + if (folio && vma && userfaultfd_minor(vma) &&
> + !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
> if (!xa_is_value(folio))
> folio_put(folio);
> *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d06453fa8aba..68a995216789 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -386,24 +386,35 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
> unsigned long dst_addr,
> uffd_flags_t flags)
> {
> - struct inode *inode = file_inode(dst_vma->vm_file);
> - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> struct folio *folio;
> struct page *page;
> int ret;
> + struct vm_fault vmf = {
> + .vma = dst_vma,
> + .address = dst_addr,
> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE |
> + FAULT_FLAG_NO_USERFAULT_MINOR,
> + .pte = NULL,
> + .page = NULL,
> + .pgoff = linear_page_index(dst_vma, dst_addr),
> + };
> +
> + if (!dst_vma->vm_ops || !dst_vma->vm_ops->fault)
> + return -EINVAL;
>
> - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> - /* Our caller expects us to return -EFAULT if we failed to find folio */
> - if (ret == -ENOENT)
> + ret = dst_vma->vm_ops->fault(&vmf);
shmem_get_folio() was being called with SGP_NOALLOC, and now it is
being called with SGP_CACHE (by shmem_fault()). This will result in a
UAPI change: UFFDIO_CONTINUE for a VA without a page in the page cache
should result in EFAULT, but now the page will be allocated.
SGP_NOALLOC was carefully chosen[1], so I think a better way to do
this will be to:
1. Have a FAULT_FLAG_USERFAULT_CONTINUE (or something)
2. In shmem_fault(), if FAULT_FLAG_USERFAULT_CONTINUE, use SGP_NOALLOC
instead of SGP_CACHE (and make sure not to drop into
handle_userfault(), of course)
[1]: https://lore.kernel.org/linux-mm/20220610173812.1768919-1-axelrasmussen@google.com/
> + if (ret & VM_FAULT_ERROR) {
> ret = -EFAULT;
> - if (ret)
> goto out;
> + }
> +
> + page = vmf.page;
> + folio = page_folio(page);
> if (!folio) {
What if ret == VM_FAULT_RETRY? I think we should retry instead instead
of returning -EFAULT.
And I'm not sure how VM_FAULT_NOPAGE should be handled, like if we
need special logic for it or not.
> ret = -EFAULT;
> goto out;
> }
>
> - page = folio_file_page(folio, pgoff);
> if (PageHWPoison(page)) {
> ret = -EIO;
> goto out_release;
> --
> 2.47.1
>
Powered by blists - more mailing lists