linux-kernel - Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem pages with user data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZmeOxAtwfTsDCi1x@google.com>
Date: Mon, 10 Jun 2024 16:40:52 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Isaku Yamahata <isaku.yamahata@...el.com>, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, michael.roth@....com, isaku.yamahata@...ux.intel.com
Subject: Re: [PATCH 09/11] KVM: guest_memfd: Add interface for populating gmem
 pages with user data

On Tue, Jun 11, 2024, Paolo Bonzini wrote:
> On 6/10/24 23:48, Paolo Bonzini wrote:
> > On Sat, Jun 8, 2024 at 1:03 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > > SNP folks and/or Paolo, what's the plan for this?  I don't see how what's sitting
> > > in kvm/next can possibly be correct without conditioning population on the folio
> > > being !uptodate.
> > 
> > I don't think I have time to look at it closely until Friday; but
> > thanks for reminding me.
> 
> Ok, I'm officially confused.  I think I understand what you did in your
> suggested code.  Limiting it to the bare minimum (keeping the callback
> instead of CONFIG_HAVE_KVM_GMEM_INITIALIZE) it would be something
> like what I include at the end of the message.
> 
> But the discussion upthread was about whether to do the check for
> RMP state in sev.c, or do it in common code using folio_mark_uptodate().
> I am not sure what you mean by "cannot possibly be correct", and
> whether it's referring to kvm_gmem_populate() in general or the
> callback in sev_gmem_post_populate().

Doing fallocate() before KVM_SEV_SNP_LAUNCH_UPDATE will cause the latter to fail.
That likely works for QEMU, at least for now, but it's unnecessarily restrictive
and IMO incorrect/wrong.

E.g. a more convoluted, fallocate() + PUNCH_HOLE + KVM_SEV_SNP_LAUNCH_UPDATE will
work (I think?  AFAICT adding and removing pages directly to/from the RMP doesn't
affect SNP's measurement, only pages that are added via SNP_LAUNCH_UPDATE affect
the measurement).

Punting the sanity check to vendor code is also gross and will make it harder to
provide a consistent, unified ABI for all architectures.  E.g. SNP returns -EINVAL
if the page is already assigned, which is quite misleading.

> The change below looks like just an optimization to me, which
> suggests that I'm missing something glaring.

I really dislike @prepare.  There are two paths that should actually initialize
the contents of the folio, and they are mutually exclusive and have meaningfully
different behavior.  Faulting in memory via kvm_gmem_get_pfn() explicitly zeros
the folio _if necessary_, whereas kvm_gmem_populate() initializes the folio with
user-provided data _and_ requires that the folio be !uptodate.

If we fix the above oddity where fallocate() initializes memory, then there's
no need to try and handle the initialization in a common chokepoint as the two
relevant paths will naturally have unique code.

The below is also still suboptimal for TDX, as KVM will zero the memory and then
TDX-module will also zero memory on PAGE.AUGA.

And I find SNP to be the odd one.  IIUC, the ASP (the artist formerly known as
the PSP) doesn't provide any guarantees about the contents of a page that is
assigned to a guest without bouncing through SNP_LAUNCH_UPDATE.  It'd be nice to
explicitly document that somewhere in the SNP code. E.g. if guest_memfd invokes
a common kvm_gmem_initialize_folio() or whatever, then SNP's implementation can
clearly capture that KVM zeros the page to protect the _host_ data.

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d4206e53a9c81..a0417ef5b86eb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -52,37 +52,39 @@ static int kvm_gmem_prepare_folio(struct inode *inode, pgoff_t index, struct fol
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare)
>  {
>  	struct folio *folio;
> +	int r;
>  	/* TODO: Support huge pages. */
>  	folio = filemap_grab_folio(inode->i_mapping, index);
>  	if (IS_ERR(folio))
>  		return folio;
> -	/*
> -	 * Use the up-to-date flag to track whether or not the memory has been
> -	 * zeroed before being handed off to the guest.  There is no backing
> -	 * storage for the memory, so the folio will remain up-to-date until
> -	 * it's removed.
> -	 *
> -	 * TODO: Skip clearing pages when trusted firmware will do it when
> -	 * assigning memory to the guest.
> -	 */
> -	if (!folio_test_uptodate(folio)) {
> -		unsigned long nr_pages = folio_nr_pages(folio);
> -		unsigned long i;
> +	if (prepare) {
> +		/*
> +		 * Use the up-to-date flag to track whether or not the memory has
> +		 * been handed off to the guest.  There is no backing storage for
> +		 * the memory, so the folio will remain up-to-date until it's
> +		 * removed.
> +		 *
> +		 * Take the occasion of the first prepare operation to clear it.
> +		 */
> +		if (!folio_test_uptodate(folio)) {
> +			unsigned long nr_pages = folio_nr_pages(folio);
> +			unsigned long i;
> -		for (i = 0; i < nr_pages; i++)
> -			clear_highpage(folio_page(folio, i));
> +			for (i = 0; i < nr_pages; i++)
> +				clear_highpage(folio_page(folio, i));
> +		}
> +
> +		r = kvm_gmem_prepare_folio(inode, index, folio);
> +		if (r < 0)
> +			goto err_unlock_put;
>  		folio_mark_uptodate(folio);
> -	}
> -
> -	if (prepare) {
> -		int r =	kvm_gmem_prepare_folio(inode, index, folio);
> -		if (r < 0) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			return ERR_PTR(r);
> +	} else {
> +		if (folio_test_uptodate(folio)) {
> +			r = -EEXIST;
> +			goto err_unlock_put;
>  		}
>  	}