linux-kernel - Re: [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDfT35EsYP/Byf7Z@yzhao56-desk.sh.intel.com>
Date: Thu, 29 May 2025 11:26:23 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <x86@...nel.org>,
	<linux-fsdevel@...r.kernel.org>, <aik@....com>, <ajones@...tanamicro.com>,
	<akpm@...ux-foundation.org>, <amoorthy@...gle.com>,
	<anthony.yznaga@...cle.com>, <anup@...infault.org>, <aou@...s.berkeley.edu>,
	<bfoster@...hat.com>, <binbin.wu@...ux.intel.com>, <brauner@...nel.org>,
	<catalin.marinas@....com>, <chao.p.peng@...el.com>, <chenhuacai@...nel.org>,
	<dave.hansen@...el.com>, <david@...hat.com>, <dmatlack@...gle.com>,
	<dwmw@...zon.co.uk>, <erdemaktas@...gle.com>, <fan.du@...el.com>,
	<fvdl@...gle.com>, <graf@...zon.com>, <haibo1.xu@...el.com>,
	<hch@...radead.org>, <hughd@...gle.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <jack@...e.cz>, <james.morse@....com>,
	<jarkko@...nel.org>, <jgg@...pe.ca>, <jgowans@...zon.com>,
	<jhubbard@...dia.com>, <jroedel@...e.de>, <jthoughton@...gle.com>,
	<jun.miao@...el.com>, <kai.huang@...el.com>, <keirf@...gle.com>,
	<kent.overstreet@...ux.dev>, <kirill.shutemov@...el.com>,
	<liam.merwick@...cle.com>, <maciej.wieczor-retman@...el.com>,
	<mail@...iej.szmigiero.name>, <maz@...nel.org>, <mic@...ikod.net>,
	<michael.roth@....com>, <mpe@...erman.id.au>, <muchun.song@...ux.dev>,
	<nikunj@....com>, <nsaenz@...zon.es>, <oliver.upton@...ux.dev>,
	<palmer@...belt.com>, <pankaj.gupta@....com>, <paul.walmsley@...ive.com>,
	<pbonzini@...hat.com>, <pdurrant@...zon.co.uk>, <peterx@...hat.com>,
	<pgonda@...gle.com>, <pvorel@...e.cz>, <qperret@...gle.com>,
	<quic_cvanscha@...cinc.com>, <quic_eberman@...cinc.com>,
	<quic_mnalajal@...cinc.com>, <quic_pderrin@...cinc.com>,
	<quic_pheragu@...cinc.com>, <quic_svaddagi@...cinc.com>,
	<quic_tsoni@...cinc.com>, <richard.weiyang@...il.com>,
	<rick.p.edgecombe@...el.com>, <rientjes@...gle.com>, <roypat@...zon.co.uk>,
	<rppt@...nel.org>, <seanjc@...gle.com>, <shuah@...nel.org>,
	<steven.price@....com>, <steven.sistare@...cle.com>,
	<suzuki.poulose@....com>, <tabba@...gle.com>, <thomas.lendacky@....com>,
	<usama.arif@...edance.com>, <vannapurve@...gle.com>, <vbabka@...e.cz>,
	<viro@...iv.linux.org.uk>, <vkuznets@...hat.com>, <wei.w.wang@...el.com>,
	<will@...nel.org>, <willy@...radead.org>, <xiaoyao.li@...el.com>,
	<yilun.xu@...el.com>, <yuzenghui@...wei.com>, <zhiquan1.li@...el.com>
Subject: Re: [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on
 fallocate(PUNCH_HOLE)

On Wed, May 28, 2025 at 09:39:35AM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
> 
> > On Wed, May 14, 2025 at 04:42:18PM -0700, Ackerley Tng wrote:
> >> Merge and truncate on fallocate(PUNCH_HOLE), but if the file is being
> >> closed, defer merging to folio_put() callback.
> >> 
> >> Change-Id: Iae26987756e70c83f3b121edbc0ed0bc105eec0d
> >> Signed-off-by: Ackerley Tng <ackerleytng@...gle.com>
> >> ---
> >>  virt/kvm/guest_memfd.c | 76 +++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 68 insertions(+), 8 deletions(-)
> >> 
> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> index cb426c1dfef8..04b1513c2998 100644
> >> --- a/virt/kvm/guest_memfd.c
> >> +++ b/virt/kvm/guest_memfd.c
> >> @@ -859,6 +859,35 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> >> +					   size_t nr_pages)
> >> +{
> >> +	struct folio *f;
> >> +	pgoff_t unused;
> >> +	long num_freed;
> >> +
> >> +	unmap_mapping_pages(inode->i_mapping, index, nr_pages, false);
> >> +
> >> +	if (!kvm_gmem_has_safe_refcount(inode->i_mapping, index, nr_pages, &unused))
> 
> Yan, thank you for your reviews!
> 
> > Why is kvm_gmem_has_safe_refcount() checked here, but not in
> > kvm_gmem_zero_range() within kvm_gmem_truncate_inode_range() in patch 33?
> >
> 
> The contract for guest_memfd with HugeTLB pages is that if holes are
> punched in any ranges less than a full huge page, no pages are removed
> from the filemap. Those ranges are only zeroed.
> 
> In kvm_gmem_zero_range(), we never remove any folios, and so there is no
> need to merge. If there's no need to merge, then we don't need to check
> for a safe refcount, and can just proceed to zero.
However, if there are still extra ref count to a shared page, its content will
be zeroed out.

> 
> kvm_gmem_merge_truncate_indices() is only used during hole punching and
> not when the file is closed. Hole punch vs file closure is checked using
> mapping_exiting(inode->i_mapping).
> 
> During a hole punch, we will only allow truncation if there are no
> unexpected refcounts on any subpages, hence this
> kvm_gmem_has_safe_refcount() check.
Hmm, I couldn't find a similar refcount check in hugetlbfs_punch_hole().
Did I overlook it?

So, why does guest_memfd require this check when punching a hole?

> >> +		return -EAGAIN;
> >> +
> >
> > Rather than merging the folios, could we simply call kvm_gmem_truncate_indices()
> > instead?
> >
> > num_freed = kvm_gmem_truncate_indices(inode->i_mapping, index, nr_pages);
> > return num_freed;
> >
> 
> We could do this too, but then that would be deferring the huge page
> merging to the folio_put() callback and eventually the kernel worker
> thread.
With deferring the huge page merging to folio_put(), a benefit is that
__kvm_gmem_filemap_add_folio() can be saved for the merged folio. This function
is possible to fail and is unnecessary for punch hole as the folio will be
removed immediately from the filemap in truncate_inode_folio().


> My goal here is to try to not to defer merging and freeing as much as
> possible so that most of the page/memory operations are
> synchronous, because synchronous operations are more predictable.
> 
> As an example of improving predictability, in one of the selftests, I do
> a hole punch and then try to allocate again. Because the merging and
> freeing of the HugeTLB page sometimes takes too long, the allocation
> sometimes fails: the guest_memfd's subpool hadn't yet received the freed
> page back. With a synchronous truncation, the truncation may take
> longer, but the selftest predictably passes.
Maybe check if guestmem_hugetlb_handle_folio_put() is invoked in the
interrupt context, and, if not, invoke the guestmem_hugetlb_cleanup_folio()
synchronously?


> >> +	f = filemap_get_folio(inode->i_mapping, index);
> >> +	if (IS_ERR(f))
> >> +		return 0;
> >> +
> >> +	/* Leave just filemap's refcounts on the folio. */
> >> +	folio_put(f);
> >> +
> >> +	WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> >> +
> >> +	num_freed = folio_nr_pages(f);
> >> +	folio_lock(f);
> >> +	truncate_inode_folio(inode->i_mapping, f);
> >> +	folio_unlock(f);
> >> +
> >> +	return num_freed;
> >> +}
> >> +
> >>  #else
> >>  
> >>  static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> @@ -874,6 +903,12 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >>  	return 0;
> >>  }
> >>  
> >> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> >> +					   size_t nr_pages)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >>  #endif
> >>  
> >>  #else
> >> @@ -1182,8 +1217,10 @@ static long kvm_gmem_truncate_indices(struct address_space *mapping,
> >>   *
> >>   * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
> >>   * inode metadata.
> >> + *
> >> + * Return: 0 on success and negative error otherwise.
> >>   */
> >> -static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >> +static long kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >>  						  pgoff_t index,
> >>  						  size_t nr_pages)
> >>  {
> >> @@ -1191,19 +1228,34 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >>  	long num_freed;
> >>  	pgoff_t idx;
> >>  	void *priv;
> >> +	long ret;
> >>  
> >>  	priv = kvm_gmem_allocator_private(inode);
> >>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >>  
> >> +	ret = 0;
> >>  	num_freed = 0;
> >>  	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
> >> -		num_freed += kvm_gmem_truncate_indices(
> >> -			inode->i_mapping, idx, nr_per_huge_page);
> >> +		if (mapping_exiting(inode->i_mapping) ||
> >> +		    !kvm_gmem_has_some_shared(inode, idx, nr_per_huge_page)) {
> >> +			num_freed += kvm_gmem_truncate_indices(
> >> +				inode->i_mapping, idx, nr_per_huge_page);
> >> +		} else {
> >> +			ret = kvm_gmem_merge_truncate_indices(inode, idx,
> >> +							      nr_per_huge_page);
> >> +			if (ret < 0)
> >> +				break;
> >> +
> >> +			num_freed += ret;
> >> +			ret = 0;
> >> +		}
> >>  	}
> >>  
> >>  	spin_lock(&inode->i_lock);
> >>  	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
> >>  	spin_unlock(&inode->i_lock);
> >> +
> >> +	return ret;
> >>  }
> >>  
> >>  /**
> >> @@ -1252,8 +1304,10 @@ static void kvm_gmem_zero_range(struct address_space *mapping,
> >>   *
> >>   * Removes full (huge)pages from the filemap and zeroing incomplete
> >>   * (huge)pages. The pages in the range may be split.
> >> + *
> >> + * Return: 0 on success and negative error otherwise.
> >>   */
> >> -static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >> +static long kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  					  loff_t lend)
> >>  {
> >>  	pgoff_t full_hpage_start;
> >> @@ -1263,6 +1317,7 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  	pgoff_t start;
> >>  	pgoff_t end;
> >>  	void *priv;
> >> +	long ret;
> >>  
> >>  	priv = kvm_gmem_allocator_private(inode);
> >>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> @@ -1279,10 +1334,11 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
> >>  	}
> >>  
> >> +	ret = 0;
> >>  	if (full_hpage_end > full_hpage_start) {
> >>  		nr_pages = full_hpage_end - full_hpage_start;
> >> -		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
> >> -						      nr_pages);
> >> +		ret = kvm_gmem_truncate_inode_aligned_pages(
> >> +			inode, full_hpage_start, nr_pages);
> >>  	}
> >>  
> >>  	if (end > full_hpage_end && end > full_hpage_start) {
> >> @@ -1290,6 +1346,8 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  
> >>  		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
> >>  	}
> >> +
> >> +	return ret;
> >>  }
> >>  
> >>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >> @@ -1298,6 +1356,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  	pgoff_t start = offset >> PAGE_SHIFT;
> >>  	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> >>  	struct kvm_gmem *gmem;
> >> +	long ret;
> >>  
> >>  	/*
> >>  	 * Bindings must be stable across invalidation to ensure the start+end
> >> @@ -1308,8 +1367,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  	list_for_each_entry(gmem, gmem_list, entry)
> >>  		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> >>  
> >> +	ret = 0;
> >>  	if (kvm_gmem_has_custom_allocator(inode)) {
> >> -		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> >> +		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> >>  	} else {
> >>  		/* Page size is PAGE_SIZE, so use optimized truncation function. */
> >>  		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> >> @@ -1320,7 +1380,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  
> >>  	filemap_invalidate_unlock(inode->i_mapping);
> >>  
> >> -	return 0;
> >> +	return ret;
> >>  }
> >>  
> >>  static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> >> -- 
> >> 2.49.0.1045.g170613ef41-goog
> >> 
>