linux-kernel - Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and private-to-shared conversion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzh5wj2lc4.fsf@google.com>
Date: Wed, 01 Oct 2025 06:21:47 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>, pbonzini@...hat.com, seanjc@...gle.com
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kas@...nel.org, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vannapurve@...gle.com, vbabka@...e.cz, 
	thomas.lendacky@....com, pgonda@...gle.com, zhiquan1.li@...el.com, 
	fan.du@...el.com, jun.miao@...el.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, 
	chao.p.peng@...el.com, yan.y.zhao@...el.com
Subject: Re: [RFC PATCH v2 17/23] KVM: guest_memfd: Split for punch hole and
 private-to-shared conversion

Yan Zhao <yan.y.zhao@...el.com> writes:

Thanks Yan! Just got around to looking at this, sorry about the delay!

> In TDX, private page tables require precise zapping because faulting back
> the zapped mappings necessitates the guest's re-acceptance. Therefore,
> before performing a zap for hole punching and private-to-shared
> conversions, huge leafs that cross the boundary of the zapping GFN range in
> the mirror page table must be split.
>
> Splitting may result in an error. If this happens, hole punching and
> private-to-shared conversion should bail out early and return an error to
> userspace.
>
> Splitting is not necessary for kvm_gmem_release() since the entire page
> table is being zapped, nor for kvm_gmem_error_folio() as an SPTE must not
> map more than one physical folio.
>
> Therefore, in this patch,
> - break kvm_gmem_invalidate_begin_and_zap() into
>   kvm_gmem_invalidate_begin() and kvm_gmem_zap() and have
>   kvm_gmem_release() and kvm_gmem_error_folio() to invoke them.
>

I think perhaps separating invalidate and zip could be a separate patch
from adding the split step into the flow, that would make this patch
smaller and easier to review.

No action required from you for now, I have the the above part in a
separate patch already (not yet posted).

> - have kvm_gmem_punch_hole() to invoke kvm_gmem_invalidate_begin(),
>   kvm_gmem_split_private(), and kvm_gmem_zap().
>   Bail out if kvm_gmem_split_private() returns error.
>

IIUC the current upstream position is that hole punching will not
be permitted for ranges smaller than the page size for the entire
guest_memfd.

Hence no splitting required during hole punch?

+ 4K guest_memfd: no splitting required since the EPT entries will not
  be larger than 4K anyway
+ 2M and 1G (x86) guest_memfd: no splitting required since the entire
  EPT entry will have to go away for valid ranges (valid ranges are
  either 2M or 1G anyway)

Does that sound right?

> - drop the old kvm_gmem_unmap_private() and have private-to-shared
>   conversion to invoke kvm_gmem_split_private() and kvm_gmem_zap() instead.
>   Bail out if kvm_gmem_split_private() returns error.
>
> Co-developed-by: Ackerley Tng <ackerleytng@...gle.com>
> Signed-off-by: Ackerley Tng <ackerleytng@...gle.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> 
> [...snip...]
> 
> @@ -514,6 +554,8 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
>  					   struct conversion_work *work,
>  					   bool to_shared, pgoff_t *error_index)
>  {
> +	int ret = 0;
> +
>  	if (to_shared) {
>  		struct list_head *gmem_list;
>  		struct kvm_gmem *gmem;
> @@ -522,19 +564,24 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
>  		work_end = work->start + work->nr_pages;
>  
>  		gmem_list = &inode->i_mapping->i_private_list;
> +		list_for_each_entry(gmem, gmem_list, entry) {
> +			ret = kvm_gmem_split_private(gmem, work->start, work_end);
> +			if (ret)
> +				return ret;
> +		}

Will be refactoring the conversion steps a little for the next version
of this series, hence I'd like to ask about the requirements before
doing splitting.

The requirement is to split before zapping, right? Other than that
we technically don't need to split before checking for a safe refcount, right?

>  		list_for_each_entry(gmem, gmem_list, entry)
> -			kvm_gmem_unmap_private(gmem, work->start, work_end);
> +			kvm_gmem_zap(gmem, work->start, work_end, KVM_FILTER_PRIVATE);
>  	} else {
>  		unmap_mapping_pages(inode->i_mapping, work->start,
>  				    work->nr_pages, false);
>  
>  		if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
>  						work->nr_pages, error_index)) {
> -			return -EAGAIN;
> +			ret = -EAGAIN;
>  		}
>  	}
>  
> -	return 0;
> +	return ret;
>  }
>  
> 
> [...snip...]
> 
> @@ -1906,8 +1926,14 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>  	start = folio->index;
>  	end = start + folio_nr_pages(folio);
>  
> -	list_for_each_entry(gmem, gmem_list, entry)
> -		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> +	/* The size of the SEPT will not exceed the size of the folio */

I think splitting might be required here, but that depends on whether we
want to unmap just a part of the huge folio or whether we want to unmap
the entire folio.

Lots of open questions on memory failure handling, but for now I think
this makes sense.

> +	list_for_each_entry(gmem, gmem_list, entry) {
> +		enum kvm_gfn_range_filter filter;
> +
> +		kvm_gmem_invalidate_begin(gmem, start, end);
> +		filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED;
> +		kvm_gmem_zap(gmem, start, end, filter);
> +	}
>  
>  	/*
>  	 * Do not truncate the range, what action is taken in response to the
> -- 
> 2.43.2