linux-kernel - Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Zg3jFRZp8F514r8b@google.com>
Date: Wed, 3 Apr 2024 16:15:33 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Isaku Yamahata <isaku.yamahata@...el.com>
Cc: David Matlack <dmatlack@...gle.com>, kvm@...r.kernel.org, isaku.yamahata@...il.com, 
	linux-kernel@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>, 
	Michael Roth <michael.roth@....com>, Federico Parola <federico.parola@...ito.it>
Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> On Wed, Mar 06, 2024 at 05:51:51PM -0800,
> > Yes. We'd like to map exact gpa range for SNP or TDX case. We don't want to map
> > zero at around range.  For SNP or TDX, we map page to GPA, it's one time
> > operation.  It updates measurement.
> > 
> > Say, we'd like to populate GPA1 and GPA2 with initial guest memory image.  And
> > they are within same 2M range.  Map GPA1 first. If GPA2 is also mapped with zero
> > with 2M page, the following mapping of GPA2 fails.  Even if mapping of GPA2
> > succeeds, measurement may be updated when mapping GPA1. 
> > 
> > It's user space VMM responsibility to map GPA range only once at most for SNP or
> > TDX.  Is this too strict requirement for default VM use case to mitigate KVM
> > page fault at guest boot up?  If so, what about a flag like EXACT_MAPPING or
> > something?
> 
> I'm thinking as follows. What do you think?
> 
> - Allow mapping larger than requested with gmem_max_level hook:

I don't see any reason to allow userspace to request a mapping level.  If the
prefetch is defined to have read fault semantics, KVM has all the wiggle room it
needs to do the optimal/sane thing, without having to worry reconcile userspace's
desired mapping level.

>   Depend on the following patch. [1]
>   The gmem_max_level hook allows vendor-backend to determine max level.
>   By default (for default VM or sw-protected), it allows KVM_MAX_HUGEPAGE_LEVEL
>   mapping.  TDX allows only 4KB mapping.
> 
>   [1] https://lore.kernel.org/kvm/20231230172351.574091-31-michael.roth@amd.com/
>   [PATCH v11 30/35] KVM: x86: Add gmem hook for determining max NPT mapping level
> 
> - Pure mapping without coco operation:
>   As Sean suggested at [2], make KVM_MAP_MEMORY pure mapping without coco
>   operation.  In the case of TDX, the API doesn't issue TDX specific operation
>   like TDH.PAGE.ADD() and TDH.EXTEND.MR().  We need TDX specific API.
> 
>   [2] https://lore.kernel.org/kvm/Ze-XW-EbT9vXaagC@google.com/
> 
> - KVM_MAP_MEMORY on already mapped area potentially with large page:
>   It succeeds. Not error.  It doesn't care whether the GPA is backed by large
>   page or not.  Because the use case is pre-population before guest running, it
>   doesn't matter if the given GPA was mapped or not, and what large page level
>   it backs.
> 
>   Do you want error like -EEXIST?

No error.  As above, I think the ioctl() should behave like a read fault, i.e.
be an expensive nop if there's nothing to be done.

For VMA-based memory, userspace can operate on the userspace address.  E.g. if
userspace wants to break CoW, it can do that by writing from userspace.  And if
userspace wants to "request" a certain mapping level, it can do that by MADV_*.

For guest_memfd, there are no protections (everything is RWX, for now), and when
hugepage support comes along, userspace can simply manipulate the guest_memfd
instance as needed.