linux-kernel - Re: [PATCH RFC v9 04/51] KVM: x86: Determine shared/private faults using a configurable mask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230622153229.vjkrzi6rgiolstns@amd.com>
Date:   Thu, 22 Jun 2023 10:32:29 -0500
From:   Michael Roth <michael.roth@....com>
To:     "Huang, Kai" <kai.huang@...el.com>
CC:     "isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "tobin@....com" <tobin@....com>,
        "liam.merwick@...cle.com" <liam.merwick@...cle.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "Luck, Tony" <tony.luck@...el.com>,
        "jmattson@...gle.com" <jmattson@...gle.com>,
        "Lutomirski, Andy" <luto@...nel.org>,
        "ak@...ux.intel.com" <ak@...ux.intel.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "pgonda@...gle.com" <pgonda@...gle.com>,
        "srinivas.pandruvada@...ux.intel.com" 
        <srinivas.pandruvada@...ux.intel.com>,
        "slp@...hat.com" <slp@...hat.com>,
        "rientjes@...gle.com" <rientjes@...gle.com>,
        "alpergun@...gle.com" <alpergun@...gle.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dovmurik@...ux.ibm.com" <dovmurik@...ux.ibm.com>,
        "thomas.lendacky@....com" <thomas.lendacky@....com>,
        "Wang, Zhi A" <zhi.a.wang@...el.com>,
        "x86@...nel.org" <x86@...nel.org>, "bp@...en8.de" <bp@...en8.de>,
        "Annapurve, Vishal" <vannapurve@...gle.com>,
        "dgilbert@...hat.com" <dgilbert@...hat.com>,
        "Christopherson,, Sean" <seanjc@...gle.com>,
        "vkuznets@...hat.com" <vkuznets@...hat.com>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "marcorr@...gle.com" <marcorr@...gle.com>,
        "ashish.kalra@....com" <ashish.kalra@....com>,
        "linux-coco@...ts.linux.dev" <linux-coco@...ts.linux.dev>,
        "nikunj.dadhania@....com" <nikunj.dadhania@....com>,
        "Rodel, Jorg" <jroedel@...e.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "sathyanarayanan.kuppuswamy@...ux.intel.com" 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "kirill@...temov.name" <kirill@...temov.name>,
        "jarkko@...nel.org" <jarkko@...nel.org>,
        "ardb@...nel.org" <ardb@...nel.org>,
        "linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
Subject: Re: [PATCH RFC v9 04/51] KVM: x86: Determine shared/private faults
 using a configurable mask

On Thu, Jun 22, 2023 at 09:55:22AM +0000, Huang, Kai wrote:
> 
> > 
> > So if we were to straight-forwardly implement that based on how TDX
> > currently handles checking for the shared bit in GPA, paired with how
> > SEV-SNP handles checking for private bit in fault flags, it would look
> > something like:
> > 
> >   bool kvm_fault_is_private(kvm, gpa, err)
> >   {
> >     /* SEV-SNP handling */
> >     if (kvm->arch.mmu_private_fault_mask)
> >       return !!(err & arch.mmu_private_fault_mask);
> > 
> >     /* TDX handling */
> >     if (kvm->arch.gfn_shared_mask)
> >       return !!(gpa & arch.gfn_shared_mask);
> 
> The logic of the two are identical.  I think they need to be converged.

I think they're just different enough that trying too hard to converge
them might obfuscate things. If the determination didn't come from 2
completely different fields (gpa vs. fault flags) maybe it could be
simplified a bit more, but have well-defined open-coded handler that
gets called once to set fault->is_private during initial fault time so
that that ugliness never needs to be looked at again by KVM MMU seems
like a good way to keep things simple through the rest of the handling.

> 
> Either SEV-SNP should convert the error code private bit to the gfn_shared_mask,
> or TDX's shared bit should be converted to some private error bit.

struct kvm_page_fault seems to be the preferred way to pass additional
params/metadata around, and .is_private field was introduced to track
this private/shared state as part of UPM base series:

  https://lore.kernel.org/lkml/20221202061347.1070246-9-chao.p.peng@linux.intel.com/

So it seems like unecessary complexity to track/encode that state into
other additional places rather than just encapsulating it all in
fault->is_private (or some similar field), and synthesizing all this
platform-specific handling into a well-defined value that's conveyed
by something like fault->is_private in a way where KVM MMU doesn't need
to worry as much about platform-specific stuff seems like a good thing,
and in line with what the UPM base series was trying to do by adding the
fault->is_private field.

So all I'm really proposing is that whatever SNP and TDX end up doing
should center around setting that fault->is_private field and keeping
everything contained there. If there are better ways to handle *how*
that's done I don't have any complaints there, but moving/adding bits
to GPA/error_flags after fault time just seems unecessary to me when
fault->is_private field can serve that purpose just as well.

> 
> Perhaps converting SEV-SNP makes more sense because if I recall correctly SEV
> guest also has a C-bit, correct?

That's correct, but the C-bit doesn't show in the GPA that gets passed
up to KVM during an #NPF, and instead gets encoded into the fault's
error_flags.

> 
> Or, ...
> 
> > 
> >     return false;
> >   }
> > 
> >   kvm_mmu_do_page_fault(vcpu, gpa, err, ...)
> >   {
> >     struct kvm_page_fault fault = {
> >       ...
> >       .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err)
> 
> ... should we do something like:
> 
> 	.is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, gpa, 
> 							    err);

We actually had exactly this in v7 of SNP hypervisor patches:

  https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m17841f5bfdfb8350d69d78c6831dd8f3a4748638

but Sean was hoping to avoid a callback, which is why we ended up using
a bitmask in this version since it basically all that callback would
need to do. It's unfortunately that we don't have a common scheme
between SNP/TDX, but maybe that's still possible, I just think that
whatever that ends up being, it should live and be contained inside
whatever helper ends up setting fault->is_private.

There's some other awkwardness with a callback approach. It sort of ties
into your question about selftests so I'll address it below...


> 
> ?
> 
> >     };
> > 
> >     ...
> >   }
> > 
> > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be
> > set per-KVM-instance, just like they are now with current SNP and TDX
> > patchsets, since stuff like KVM self-test wouldn't be setting those
> > masks, so it makes sense to do it per-instance in that regard.
> > 
> > But that still gets a little awkward for the KVM self-test use-case where
> > .is_private should sort of be ignored in favor of whatever the xarray
> > reports via kvm_mem_is_private(). 
> > 
> 
> I must have missed something.  Why does KVM self-test have impact to how does
> KVM handles private fault? 

The self-tests I'm referring to here are the ones from Vishal that shipped with
v10 of Chao's UPM/fd-based private memory series, and also as part of Sean's
gmem tree:

  https://github.com/sean-jc/linux/commit/a0f5f8c911804f55935094ad3a277301704330a6

These exercise gmem/UPM handling without the need for any SNP/TDX
hardware support. They do so by "trusting" the shared/private state
that the VMM sets via KVM_SET_MEMORY_ATTRIBUTES. So if VMM says it
should be private, KVM MMU will treat it as private, so we'd never
get a mismatch, so KVM_EXIT_MEMORY_FAULT will never be generated.

> 
> > In your Misc. series I believe you
> > handled this by introducing a PFERR_HASATTR_MASK bit so we can determine
> > whether existing value of fault->is_private should be
> > ignored/overwritten or not.
> > 
> > So maybe kvm_fault_is_private() needs to return an integer value
> > instead, like:
> > 
> >   enum {
> >     KVM_FAULT_VMM_DEFINED,
> >     KVM_FAULT_SHARED,
> >     KVM_FAULT_PRIVATE,
> >   }
> > 
> >   bool kvm_fault_is_private(kvm, gpa, err)
> >   {
> >     /* SEV-SNP handling */
> >     if (kvm->arch.mmu_private_fault_mask)
> >       (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED
> > 
> >     /* TDX handling */
> >     if (kvm->arch.gfn_shared_mask)
> >       (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE
> > 
> >     return KVM_FAULT_VMM_DEFINED;
> >   }
> > 
> > And then down in __kvm_faultin_pfn() we do:
> > 
> >   if (fault->is_private == KVM_FAULT_VMM_DEFINED)
> >     fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn);
> >   else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> >     return kvm_do_memory_fault_exit(vcpu, fault);
> > 
> >   if (fault->is_private)
> >     return kvm_faultin_pfn_private(vcpu, fault);
> 
> 
> What does KVM_FAULT_VMM_DEFINED mean, exactly?
> 
> Shouldn't the fault type come from _hardware_?

In above self-test use-case, there is no reliance on hardware support,
and fault->is_private should always be treated as being whatever was
set by the VMM via KVM_SET_MEMORY_ATTRIBUTES, so that's why I proposed
the KVM_FAULT_VMM_DEFINED value to encode that case into
fault->is_private so KVM MMU and handle protected self-test VMs of this
sort.

In the future, this protected self-test VMs might become the basis of
a new protected VM type where some sort of guest-issued hypercall can
be used to set whether a fault should be treated as shared/private,
rather than relying on VMM-defined value. There's some current discussion
about that here:

  https://lore.kernel.org/lkml/20230620190443.GU2244082@ls.amr.corp.intel.com/T/#me627bed3d9acf73ea882e8baa76dfcb27759c440

Going back to your callback question above, that makes things a little
awkward, since kvm_x86_ops is statically defined for both
kvm_amd/kvm_intel modules, and either can run these self-tests guests as
well as these proposed "non-CC VMs" which rely on enlightened guest
kernels instead of TDX/SNPhardware support for managing private/shared
access.

So you either need to duplicate the handling for how to determine
private/shared for these other types into the kvm_intel/kvm_amd callbacks,
or have some way for the callback to say to "fall back to the common
handling for self-tests and non-CC VMs". The latter is what we implemented
in v8 of this series, but Isaku suggested it was a bit too heavyweight
and proposed dropping the fall-back logic in favor of updating the
kvm_x86_ops at run-time once we know whether or not it's a TDX/SNP guest:

  https://lkml.iu.edu/hypermail/linux/kernel/2303.2/03009.html

which could work, but it still doesn't address Sean's desire to avoid
callbacks completely, and still amounts to a somewhat convulated way
to hide away TDX/SNP-specific bit checks for shared/private. Rather
than hide them away in callbacks that are already frowned upon by
maintainer, I think it makes sense to "open-code" all these checks in a
common handler like kvm_fault_is_private() to we can make some progress
toward a consensus, and then iterate on it from there rather than
refining what may already be a dead-end path.

-Mike