linux-kernel - Re: [PATCH Part2 v5 39/45] KVM: SVM: Introduce ops for the post gfn map and unmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220908212114.sqne7awimfwfztq7@amd.com>
Date:   Thu, 8 Sep 2022 16:21:14 -0500
From:   Michael Roth <michael.roth@....com>
To:     Sean Christopherson <seanjc@...gle.com>
CC:     Brijesh Singh <brijesh.singh@....com>, <x86@...nel.org>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
        <linux-crypto@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Ingo Molnar" <mingo@...hat.com>, Joerg Roedel <jroedel@...e.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "H. Peter Anvin" <hpa@...or.com>, Ard Biesheuvel <ardb@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Sergio Lopez <slp@...hat.com>, Peter Gonda <pgonda@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Srinivas Pandruvada" <srinivas.pandruvada@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Dov Murik <dovmurik@...ux.ibm.com>,
        "Tobin Feldman-Fitzthum" <tobin@....com>,
        Borislav Petkov <bp@...en8.de>,
        "Vlastimil Babka" <vbabka@...e.cz>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        "Andi Kleen" <ak@...ux.intel.com>, <tony.luck@...el.com>,
        <marcorr@...gle.com>, <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        <jarkko@...fian.com>
Subject: Re: [PATCH Part2 v5 39/45] KVM: SVM: Introduce ops for the post gfn
 map and unmap

On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> On Fri, Oct 15, 2021, Brijesh Singh wrote:
> > 
> > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > >>> either be a private or shared. A write from the hypervisor goes through
> > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > >>> to a guest private page, then it triggers an RMP violation #PF.
> > >>>
> > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > >>> protect against the page state change for existing mapped pages.
> > >> SRCU isn't protecting anything.  The synchronize_srcu_expedited() in the PSC code
> > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > >> from being created while the actual RMP updates are in-flight.  Most telling is
> > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > Argh, another goof on my part.  Rereading prior feedback, I see that I loosely
> > > suggested SRCU as a possible solution.  That was a bad, bad suggestion.  I think
> > > (hope) I made it offhand without really thinking it through.  SRCU can't work in
> > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > multiple copies of the data.  That simply can't be true for the RMP as hardware
> > > operates on a single table.
> > >
> > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > especially those that are made without concrete examples, i.e. are likely off the
> > > cuff.  My goal isn't to set you up for failure :-/
> > 
> > What do you think about going back to my initial proposal of per-gfn
> > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > and let the copy_to_user() take a fault and return an error (if it
> > attempt to write to guest private). If PSC happen while lock is held
> > then simplify return and let the guest retry PSC.
> 
> That approach is also broken as it doesn't hold a lock when updating host_write_track,
> e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> in-progress readers.
> 
> I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> to spend time iterating on the KVM case until consensus has been reached on how
> exactly RMP updates will be handled, and in general how the kernel will manage
> guest private memory.

Hi Sean,

(Sorry in advance for the long read, but it touches on a couple
inter-tangled topics that I think are important for this series.)

While we do still remain committed to working with community toward
implementing a UPM solution, we would still like to have a 'complete'
solution in the meantime, if at the very least to use as a basis for
building out other parts of the stack, or for enabling early adopters
downstream doing the same.

Toward that end, there are a couple areas that need to be addressed,
(and remain unaddressed with v6, since we were planning to leverage UPM
to handle them and so left them as open TODOs at the time):

 1) this issue of guarding against shared->private conversions for pages
    that are in-use by kernel
 2) how to deal with unmapping/splitting the direct map

(These 2 things end up being fairly tightly coupled, which is why I'm
trying to cover them both in this email. Will explain more in a bit)

You mentioned elsewhere how 1) would be nicely addressed with UPM, since
the conversion would only point the guest to an entirely new page, while
leaving the shared page intact (at least until the normal notifiers do
their thing in response to any subsequent discard operations from
userspace). If the guest breaks because it doesn't see the write, that's
okay, because it's not supposed to be switching in-use shared pages to
private while they are being used. (though the pKVM folks did note in v7
of private memslot patches that they actually use the same physical page
for both shared/private, so I'm not sure if that approach will still
stack up in that context, if it ends up being needed there)

So in the context of this interim solution, we're trying to look for a
solution that's simple enough that it can be used reliably, without
introducing too much additional complexity into KVM. There is one
approach that seems to fit that bill, that Brijesh attempted in an
earlier version of this series (I'm not sure what exactly was the
catalyst to changing the approach, as I wasn't really in the loop at
the time, but AIUI there weren't any showstoppers there, but please
correct me if I'm missing anything):

 - if the host is writing to a page that it thinks is supposed to be
   shared, and the guest switches it to private, we get an RMP fault
   (actually, we will get a !PRESENT fault, since as of v5 we now
   remove the mapping from the directmap as part of conversion)
 - in the host #PF handler, if we see that the page is marked private
   in the RMP table, simply switch it back to shared
 - if this was a bug on the part of the host, then the guest will see
   the validated bit was cleared, and get a #VC where it can
   terminate accordingly
 - if this was a bug (or maliciousness) on the part of the guest,
   then the behavior is unexpected anyway, and it will be made
   aware this by the same mechanism

We've done some testing with this approach and it seems to hold up,
but since we also need the handler to deal with the !PRESENT case
(because of the current approach of nuking directmap entry for
private pages), there's a situation that can't be easily resolved
with this approach:

  # CASE1: recoverable
  
    THREAD 1              THREAD 2
    - #PF(!PRESENT)       - #PF(!PRESENT)
    - restore directmap
    - RMPUPDATE(shared)
                          - not private? okay to retry since maybe it was
                            fixed up already
  
  # CASE2: unrecoverable
  
    THREAD 1
    - broken kernel breaks directmap/init-mm, not related to SEV
    - #PF(!PRESENT)
    - not private? we should oops, since retrying might cause a loop
  
The reason we can't distinguish between recoverable/unrecoverable is
because the kernel #PF handling here needs to happen for both !PRESENT
and for RMP violations. This is due to how we unmap from directmap as
part of shared->private conversion.

So we have to take the pessimistic approach due to CASE2, which
means that if CASE1 pops up, we'll still have a case where guest can
cause a crash/loop.

That where 2) comes into play. If we go back to the original approach
(as of v4) of simply splitting the directmap when we do shared->private
conversions, then we can limit the #PF handling to only have to deal
with cases where there's an RMP violation and X86_PF_RMP is set, which
makes it safe to retry to handle spurious cases like case #1.

AIUI, this is still sort of an open question, but you noted how nuking
the directmap without any formalized interface for letting the kernel
know about it could be problematic down the road, which also sounds
like the sort of thing more suited for having UPM address at a more
general level, since there are similar requirements for TDX as well.

AIUI there are 2 main arguments against splitting the directmap:
 a) we can't easily rebuild it atm
 b) things like KSM might still tries to access private pages

But unmapping also suffers from a), since we still end up splitting the
directmap unless we remove pages in blocks of 2M. But nothing prevents
a guest from switching a single 4K page to private, in which case we
are forced to split. That would be normal behavior on the part of the
guest for setting up GHCB pages/etc, so we still end up splitting the
directmap over time.

And for b), as you noted, this is already something that SEV/SEV-ES
need to deal with, and at least in the case of SNP guest things are
a bit better since:

  -  if some kernel thread tried to write to private memory, they
     would notice if some errant kernel thread tried to write to guest
     memory, since #PF handler with flip the page and unset the
     validated bit on that memory.

  - for reads, they will get ciphertext, which isn't any worse than
    reading unencrypted guest memory for anything outside of KVM that
    specifically knows what it is expecting to read there (KSM being
    one exception, since it'll waste cycles cmp'ing ciphertext, but
    there's no way to make it work anyway, and we would be fine with
    requiring it to be disabled if that is a concern)

So that's sort of the context for why we're considering these 2 changes.
Any input on these is welcome, but at the very least wanted to provide
our rationale for past reviewers.

Thanks!

-Mike