lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aF6h7rYVnVTMtJ0S@x1.local>
Date: Fri, 27 Jun 2025 09:51:42 -0400
From: Peter Xu <peterx@...hat.com>
To: Nikita Kalyazin <kalyazin@...zon.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Hugh Dickins <hughd@...gle.com>, Oscar Salvador <osalvador@...e.de>,
	Michal Hocko <mhocko@...e.com>,
	David Hildenbrand <david@...hat.com>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Ujwal Kundur <ujwal.kundur@...il.com>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>,
	James Houghton <jthoughton@...gle.com>,
	Mike Rapoport <rppt@...nel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Axel Rasmussen <axelrasmussen@...gle.com>
Subject: Re: [PATCH 0/4] mm/userfaultfd: modulize memory types

On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 25/06/2025 21:17, Peter Xu wrote:
> > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 20/06/2025 20:03, Peter Xu wrote:
> > > > [based on akpm/mm-new]
> > > > 
> > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > initial three patches:
> > > > 
> > > >     https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > > 
> > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > 
> > > Hi Peter,
> > 
> > Hi, Nikita,
> > 
> > > 
> > > Thanks for posting this.  I confirmed that minor fault handling was working
> > > for guest_memfd based on this series and looked simple (a draft based on
> > > mmap support in guest_memfd v7 [1]):
> > 
> > Thanks for the quick spin, glad to know it works. Some trivial things to
> > mention below..
> 
> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
> works as well:

Appreciated.

Since at it, I'll comment quickly below.

> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8c44e4b9f5f8..b5458a22fff4 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
> *file, pgoff_t index)
> 
>  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  {
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
> 
>  	filemap_invalidate_lock_shared(inode->i_mapping);
> 
> +	folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
> +	if (!folio && vma && userfaultfd_missing(vma)) {
> +		filemap_invalidate_unlock_shared(inode->i_mapping);
> +		return handle_userfault(vmf, VM_UFFD_MISSING);
> +	}

Likely a possible refcount leak when folio != NULL here.

> +
>  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		int err = PTR_ERR(folio);
> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
> *inode, pgoff_t pgoff,
>  	return 0;
>  }
> 
> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +			   struct vm_area_struct *dst_vma,
> +			   unsigned long dst_addr,
> +			   unsigned long src_addr,
> +			   uffd_flags_t flags,
> +			   struct folio **foliop)
> +{
> +	struct inode *inode = file_inode(dst_vma->vm_file);
> +	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = kvm_gmem_get_folio(inode, pgoff);
> +	if (IS_ERR(folio)) {
> +		ret = PTR_ERR(folio);
> +		goto out;
> +	}
> +
> +	folio_unlock(folio);
> +
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> +		void *vaddr = kmap_local_folio(folio, 0);
> +		ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
> +		kunmap_local(vaddr);
> +		if (unlikely(ret)) {
> +			*foliop = folio;
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +	} else {		/* ZEROPAGE */
> +		clear_user_highpage(&folio->page, dst_addr);
> +	}
> +
> +	kvm_gmem_mark_prepared(folio);

Since Faud's series hasn't yet landed, so I'm almost looking at the current
code base with an imagination of what might happen.

In general, missing trapping for guest-memfd could start to be slightly
trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
a prior fallocate() syscall, not during fault.  So I suppose we will need
to use uptodate bit to mark folio ready, like what's done here.

If so, we may want to make sure in fault path any !uptodate fault will get
trapped for missing too, even if it sounds not strictly a "cache miss"
... so slightly confusing but sounds necessary.

Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into
account, because CoCo needs to prepare the pages, so mark uptodate may not
be enough?  I don't know well on the CoCo side to tell.  Otherwise we'll at
least need to restrict MISSING traps to only happen on fully shared
guest-memfds.

OTOH, MINOR should be much easier to be done for guest-memfd, not only
because the code to support that would be very minimum which is definitely
lovely, but also because it's still pretty common idea to monitor pgtable
entries, and it should logically even apply to CoCo: in a fault(), we need
to check whether the guest-memfd folio is "shared" and/or "faultable"
first; it should already fail the fault() if it's a private folio.  Then if
it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a
MINOR too.  For !CoCo it'll always trap as it's always faultable.

MINOR also makes more sense to be used in the future with 1G postcopy
support on top of gmem, because that's almost the only way to go.  Looks
like we've made up our mind to reuse Hugetlb pages for gmem which sounds
good, then Hugetlb pages are in 1G granule in allocations, and we can't
easily do 4K miss trapping on one 1G huge page.  MINOR is simpler but
actually more powerful from that POV.

To summarize, I think after this we can do MINOR before MISSING for
guest-memfd if MINOR already works for you.  We can leave MISSING until we
know how we would use it.

Thanks,

> +
> +	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> +				       &folio->page, true, flags);
> +
> +	if (ret)
> +		folio_put(folio);
> +out:
> +	return ret;
> +}
> +
>  static const vm_uffd_ops kvm_gmem_uffd_ops = {
> -	.uffd_features	= 	VM_UFFD_MINOR,
> -	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
> +	.uffd_features	= 	VM_UFFD_MISSING | VM_UFFD_MINOR,
> +	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> +				BIT(_UFFDIO_ZEROPAGE) |
> +				BIT(_UFFDIO_CONTINUE),
>  	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
> +	.uffd_copy	=	kvm_gmem_mfill_atomic_pte,
>  };
>  #endif
> 
> > 
> > > 
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 5abb6d52a375..6ddc73419724 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -5,6 +5,9 @@
> > >   #include <linux/pagemap.h>
> > >   #include <linux/anon_inodes.h>
> > >   #include <linux/set_memory.h>
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > This ifdef not needed, userfaultfd_k.h has taken care of all cases.
> 
> Good to know, thanks.
> 
> > > +#include <linux/userfaultfd_k.h>
> > > +#endif
> > > 
> > >   #include "kvm_mm.h"
> > > 
> > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > >                kvm_gmem_mark_prepared(folio);
> > >        }
> > > 
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > Same here.  userfaultfd_minor() is always defined.
> 
> Thank you.
> 
> > I'll wait for a few more days for reviewers, and likely send v2 before next
> > week.
> > 
> > Thanks,
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ