linux-kernel - Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com>
Date: Thu, 05 Jun 2025 12:10:08 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	x86@...nel.org, linux-fsdevel@...r.kernel.org, aik@....com, 
	ajones@...tanamicro.com, akpm@...ux-foundation.org, amoorthy@...gle.com, 
	anthony.yznaga@...cle.com, anup@...infault.org, aou@...s.berkeley.edu, 
	bfoster@...hat.com, binbin.wu@...ux.intel.com, brauner@...nel.org, 
	catalin.marinas@....com, chao.p.peng@...el.com, chenhuacai@...nel.org, 
	dave.hansen@...el.com, david@...hat.com, dmatlack@...gle.com, 
	dwmw@...zon.co.uk, erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, 
	graf@...zon.com, haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, 
	ira.weiny@...el.com, isaku.yamahata@...el.com, jack@...e.cz, 
	james.morse@....com, jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, 
	jhubbard@...dia.com, jroedel@...e.de, jthoughton@...gle.com, 
	jun.miao@...el.com, kai.huang@...el.com, keirf@...gle.com, 
	kent.overstreet@...ux.dev, kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com, 
	thomas.lendacky@....com, usama.arif@...edance.com, vannapurve@...gle.com, 
	vbabka@...e.cz, viro@...iv.linux.org.uk, vkuznets@...hat.com, 
	wei.w.wang@...el.com, will@...nel.org, willy@...radead.org, 
	xiaoyao.li@...el.com, yilun.xu@...el.com, yuzenghui@...wei.com, 
	zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for
 guest_memfd use

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
>> +static int kvm_gmem_convert_execute_work(struct inode *inode,
>> +					 struct conversion_work *work,
>> +					 bool to_shared)
>> +{
>> +	enum shareability m;
>> +	int ret;
>> +
>> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
>> +	ret = kvm_gmem_shareability_apply(inode, work, m);
>> +	if (ret)
>> +		return ret;
>> +	/*
>> +	 * Apply shareability first so split/merge can operate on new
>> +	 * shareability state.
>> +	 */
>> +	ret = kvm_gmem_restructure_folios_in_range(
>> +		inode, work->start, work->nr_pages, to_shared);
>> +
>> +	return ret;
>> +}
>> +

Hi Yan,

Thanks for your thorough reviews and your alternative suggestion in the
other discussion at [1]! I'll try to bring the conversion-related parts
of that discussion over here.

>>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  				  size_t nr_pages, bool shared,
>>  				  pgoff_t *error_index)

The guiding principle I was using for the conversion ioctls is

* Have the shareability updates and any necessary page restructuring
  (aka splitting/merging) either fully complete for not at all by the
  time the conversion ioctl returns.
* Any unmapping (from host or guest page tables) will not be re-mapped
  on errors.
* Rollback undoes changes if conversion failed, and in those cases any
  errors are turned into WARNings.

The rationale is that we want page sizes to be in sync with shareability
so that any faults after the (successful or failed) conversion will not
wrongly map in a larger page than allowed and cause any host crashes.

We considered 3 places where the memory can be mapped for conversions:

1. Host page tables
2. Guest page tables
3. IOMMU page tables

Unmapping from host page tables is the simplest case. We unmap any
shared ranges from the host page tables. Any accesses after the failed
conversion would just fault the memory back in and proceed as usual.

guest_memfd memory is not unmapped from IOMMUs in conversions. This case
is handled because IOMMU mappings hold refcounts. After unmapping from
the host, we check for unexpected refcounts and fail if there are
unexpected refcounts.

We also unmap from guest page tables. Considering failed conversions, if
the pages are shared, we're good since the next time the guest accesses
the page, the page will be faulted in as before.

If the pages are private, on the next guest access, the pages will be
faulted in again as well. This is fine for software-protected VMs IIUC.

For TDX (and SNP) IIUC the memory would have been cleared, and the
memory would also need to be re-accepted. I was thinking that this is by
design, since when a TDX guest requests a conversion it knows that the
contents is not to be used again.

The userspace VMM is obligated to keep trying convert and if it gives
up, userspace VMM should inform the guest that the conversion
failed. The guest should handle conversion failures too and not assume
that conversion always succeeds.

Putting TDX aside for a moment, so far, there are a few ways this
conversion could fail:

a. Unexpected refcounts. Userspace should clear up the unexpected
   refcounts and report failure to the guest if it can't for whatever
   reason.
b. ENOMEM because (i) we ran out of memory updating the shareability
   maple_tree or (ii) since splitting involves allocating more memory
   for struct pages and we ran out of memory there. In this case the
   userspace VMM gets -ENOMEM and can make more memory available and
   then retry, or if it can't, also report failure to the guest.

TDX introduces TDX-specific conversion failures (see discussion at
[1]), which this series doesn't handle, but I think we still have a line
of sight to handle new errors.

In the other thread [1], I was proposing to have guest_memfd decide what
to do on errors, but I think that might be baking more TDX-specific
details into guest_memfd/KVM, and perhaps this is better:

We could return the errors to userspace and let userspace determine what
to do. For retryable errors (as determined by userspace), it should do
what it needs to do, and retry. For errors like TDX being unable to
reclaim the memory, it could tell guest_memfd to leak that memory.

If userspace gives up, it should report conversion failure to the guest
if userspace thinks the guest can continue (to a clean shutdown or
otherwise). If something terrible happened during conversion, then
userspace might have to exit itself or shutdown the host.

In [2], for TDX-specific conversion failures, you proposed prepping to
eliminate errors and exiting early on failure, then actually
unmapping. I think that could work too.

I'm a little concerned that prepping could be complicated, since the
nature of conversion depends on the current state of shareability, and
there's a lot to prepare, everything from counting memory required for
maple_tree allocation (and merging ranges in the maple_tree), and
counting the number of pages required for undoing vmemmap optimization
in the case of splitting...

And even after doing all the prep to eliminate errors, the unmapping
could fail in TDX-specific cases anyway, which still needs to be
handled.

Hence I'm hoping you'll consider to let TDX-specific failures be
built-in and handled alongside other failures by getting help from the
userspace VMM, and in the worst case letting the guest know the
conversion failed.

I also appreciate comments or suggestions from anyone else!

[1] https://lore.kernel.org/all/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com/

>> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  
>>  	list_for_each_entry(work, &work_list, list) {
>>  		rollback_stop_item = work;
>> -		ret = kvm_gmem_shareability_apply(inode, work, m);
>> +
>> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
>>  		if (ret)
>>  			break;
>>  	}
>>  
>>  	if (ret) {
>> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>>  		list_for_each_entry(work, &work_list, list) {
>> +			int r;
>> +
>> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
>> +			WARN_ON(r);
>> +
>>  			if (work == rollback_stop_item)
>>  				break;
>> -
>> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> Could kvm_gmem_shareability_apply() fail here?
>

Yes, it could. If shareability cannot be updated, then we probably ran
out of memory. Userspace VMM will probably get -ENOMEM set on some
earlier ret and should handle that accordingly.

On -ENOMEM in a rollback, the host is in a very tough spot anyway, and a
clean guest shutdown may be the only way out, hence this is a WARN and
not returned to userspace.

>>  		}
>>  	}
>>  
>> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_KVM_GMEM_HUGETLB
>> +
>> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	struct address_space *mapping = folio->mapping;
>> +
>> +	spin_lock(&mapping->host->i_lock);
>> +	xa_lock_irq(&mapping->i_pages);
>> +
>> +	__filemap_remove_folio(folio, NULL);
>> +
>> +	xa_unlock_irq(&mapping->i_pages);
>> +	spin_unlock(&mapping->host->i_lock);
>> +}
>> +
>> +/**
>> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
>> + * split/merge.
>> + *
>> + * @folio: the folio to be removed.
>> + *
>> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
>> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
>> + *
>> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
>> + *          freeze these refcounts away so that no other users will interfere
>> + *          with restructuring.
>> + */
>> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	int filemap_refcount;
>> +
>> +	filemap_refcount = folio_nr_pages(folio);
>> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
>> +		/*
>> +		 * At this point only filemap refcounts are expected, hence okay
>> +		 * to spin until speculative refcounts go away.
>> +		 */
>> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
>> +	}
>> +
>> +	folio_lock(folio);
>> +	__filemap_remove_folio_for_restructuring(folio);
>> +	folio_unlock(folio);
>> +}
>> +
>> +/**
>> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @folio: folio to be split.
>> + *
>> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
>> + * and add back the split folios.
>> + *
>> + * Context: Expects that before this call, folio's refcount is just the
>> + *          filemap's refcounts. After this function returns, the split folios'
>> + *          refcounts will also be filemap's refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
>> +{
>> +	size_t orig_nr_pages;
>> +	pgoff_t orig_index;
>> +	size_t i, j;
>> +	int ret;
>> +
>> +	orig_nr_pages = folio_nr_pages(folio);
>> +	if (orig_nr_pages == 1)
>> +		return 0;
>> +
>> +	orig_index = folio->index;
>> +
>> +	filemap_remove_folio_for_restructuring(folio);
>> +
>> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
>> +	if (ret)
>> +		goto err;
>> +
>> +	for (i = 0; i < orig_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(folio, i));
>> +
>> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
>> +						   orig_index + i);
> Why does the failure of __kvm_gmem_filemap_add_folio() here lead to rollback,    
> while the failure of the one under rollback only triggers WARN_ON()?
>

Mostly because I don't really have a choice on rollback. On rollback we
try to restore the merged folio back into the filemap, and if we
couldn't, the host is probably in rather bad shape in terms of memory
availability and there may not be many options for the userspace VMM.

Perhaps the different possible errors from
__kvm_gmem_filemap_add_folio() in both should be handled differently. Do
you have any suggestions on that?

>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +
>> +	return ret;
>> +
>> +rollback:
>> +	for (j = 0; j < i; ++j) {
>> +		struct folio *f = page_folio(folio_page(folio, j));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
>> +err:
>> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
>> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
>> + * @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @first_folio: first folio among folios to be merged.
>> + *
>> + * Will clean up subfolios from filemap and add back the merged folio.
>> + *
>> + * Context: Expects that before this call, all subfolios only have filemap
>> + *          refcounts. After this function returns, the merged folio will only
>> + *          have filemap refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
>> +					   struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	void *priv;
>> +	size_t i;
>> +	int ret;
>> +
>> +	index = first_folio->index;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +	if (folio_nr_pages(first_folio) == to_nr_pages)
>> +		return 0;
>> +
>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
>> +
>> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
>> +	if (ret)
>> +		goto err_split;
>> +
>> +	return ret;
>> +
>> +err_split:
>> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
> guestmem_hugetlb_split_folio() is possible to fail. e.g.
> After the stash is freed by guestmem_hugetlb_unstash_free_metadata() in
> guestmem_hugetlb_merge_folio(), it's possible to get -ENOMEM for the stash
> allocation in guestmem_hugetlb_stash_metadata() in
> guestmem_hugetlb_split_folio().
>
>

Yes. This is also on the error path. In line with all the other error
and rollback paths, I don't really have other options at this point,
since on error, I probably ran out of memory, so I try my best to
restore the original state but give up with a WARN otherwise.

>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
>> +						      struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
>> +		return 0;
>> +
>> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
>> +}
>> +
>> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>> +						pgoff_t start, size_t nr_pages,
>> +						bool is_split_operation)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	pgoff_t end;
>> +	void *priv;
>> +	int ret;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	end = start + nr_pages;
>> +
>> +	/* Round to allocator page size, to check all (huge) pages in range. */
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	start = round_down(start, to_nr_pages);
>> +	end = round_up(end, to_nr_pages);
>> +
>> +	for (index = start; index < end; index += to_nr_pages) {
>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> kvm_gmem_try_split_folio_in_filemap()?
>

Here we know for sure that this was a private-to-shared
conversion. Hence, we know that there are at least some shared parts in
this huge page and we can skip checking that. 

>> +		else
>> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
>> +

For merge, we don't know if the entire huge page might perhaps contain
some other shared subpages, hence we "try" to merge by first checking
against shareability to find shared subpages.

>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +	return ret;
>> +
>> +rollback:
>> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {

Note to self: the first index -= to_nr_pages was meant to skip the index
that caused the failure, but this could cause an underflow if index = 0
when entering rollback. Need to fix this in the next revision.

>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
>> +		else
>> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
> Is it safe to just leave WARN_ON()s in the rollback case?
>

Same as above. I don't think we have much of a choice.

> Besides, are the kvm_gmem_merge_folio_in_filemap() and
> kvm_gmem_split_folio_in_filemap() here duplicated with the
> kvm_gmem_split_folio_in_filemap() and kvm_gmem_try_merge_folio_in_filemap() in
> the following "r = kvm_gmem_convert_execute_work(inode, work, !shared)"?
>

This handles the case where some pages in the range [start, start +
nr_pages) were split and the failure was halfway through. I could call
kvm_gmem_convert_execute_work() with !shared but that would go over all
the folios again from the start.

>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +#else
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>> +						pgoff_t start, size_t nr_pages,
>> +						bool is_split_operation)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>