linux-kernel - Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL715WL9TJzDxZE8_gfhUQFGtOAydG0kyuSbzkqWTs3pc57j7A@mail.gmail.com>
Date:   Fri, 18 Aug 2023 19:08:30 -0700
From:   Mingwei Zhang <mizhang@...gle.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        Jacky Li <jackyli@...gle.com>
Cc:     Ashish Kalra <ashish.kalra@....com>, isaku.yamahata@...el.com,
        kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        isaku.yamahata@...il.com, Michael Roth <michael.roth@....com>,
        Paolo Bonzini <pbonzini@...hat.com>, erdemaktas@...gle.com,
        Sagi Shahar <sagis@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Kai Huang <kai.huang@...el.com>,
        Zhi Wang <zhi.wang.linux@...il.com>, chen.bo@...el.com,
        linux-coco@...ts.linux.dev,
        Chao Peng <chao.p.peng@...ux.intel.com>,
        Ackerley Tng <ackerleytng@...gle.com>,
        Vishal Annapurve <vannapurve@...gle.com>,
        Yuan Yao <yuan.yao@...ux.intel.com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Xu Yilun <yilun.xu@...el.com>,
        Quentin Perret <qperret@...gle.com>, wei.w.wang@...el.com,
        Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 4/8] KVM: gmem: protect kvm_mmu_invalidate_end()

+Jacky Li

On Fri, Aug 18, 2023 at 3:45 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> +Mingwei to correct me if I'm wrong
>
> On Fri, Aug 18, 2023, Ashish Kalra wrote:
> >
> > On 8/18/2023 12:55 PM, Sean Christopherson wrote:
> > > On Tue, Aug 15, 2023, isaku.yamahata@...el.com wrote:
> > > > From: Isaku Yamahata <isaku.yamahata@...el.com>
> > > >
> > > > kvm_mmu_invalidate_end() updates struct kvm::mmu_invalidate_in_progress
> > > > and it's protected by kvm::mmu_lock.  call kvm_mmu_invalidate_end() before
> > > > unlocking it. Not after the unlock.
> > > >
> > > > Fixes: 8e9009ca6d14 ("KVM: Introduce per-page memory attributes")
> > >
> > > This fixes is wrong.  It won't matter in the long run, but it makes my life that
> > > much harder.
> > >
> > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
> > > > ---
> > > >   virt/kvm/kvm_main.c | 15 ++++++++++++++-
> > > >   1 file changed, 14 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 8bfeb615fc4d..49380cd62367 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -535,6 +535,7 @@ struct kvm_mmu_notifier_range {
> > > >           } arg;
> > > >           gfn_handler_t handler;
> > > >           on_lock_fn_t on_lock;
> > > > + on_unlock_fn_t before_unlock;
> > > >           on_unlock_fn_t on_unlock;
> > >
> > > Ugh, shame on my past me.  Having on_lock and on_unlock be asymmetrical with respect
> > > to the lock is nasty.
> > >
> > > I would much rather we either (a) be explicit, e.g. before_(un)lock and after_(un)lock,
> > > or (b) have just on_(un)lock, make them symetrical, and handle the SEV mess a
> > > different way.
> > >
> > > The SEV hook doesn't actually care about running immediately after unlock, it just
> > > wants to know if there was an overlapping memslot.  It can run after SRCU is dropped,
> > > because even if we make the behavior more precise (right now it blasts WBINVD),
> > > just having a reference to memslots isn't sufficient, the code needs to guarantee
> > > memslots are *stable*.  And that is already guaranteed by the notifier code, i.e.
> > > the SEV code could just reacquire SRCU.
> >
> > On a separate note here, the SEV hook blasting WBINVD is still causing
> > serious performance degradation issues with SNP triggered via
> > AutoNUMA/numad/KSM, etc. With reference to previous discussions related to
> > it, we have plans to replace WBINVD with CLFLUSHOPT.
>
> Isn't the flush unnecessary when freeing shared memory?  My recollection is that
> the problematic scenario is when encrypted memory is freed back to the host,
> because KVM already flushes when potentially encrypted mapping memory into the
> guest.
>
> With SNP+guest_memfd, private/encrypted memory should be unreachabled via the
> hva-based mmu_notifiers.  gmem should have full control of the page lifecycles,
> i.e. can get the kernel virtual address as appropriated, and so it SNP shouldn't
> need the nuclear option.
>
> E.g. something like this?
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 07756b7348ae..1c6828ae391d 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2328,7 +2328,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>
>  void sev_guest_memory_reclaimed(struct kvm *kvm)
>  {
> -       if (!sev_guest(kvm))
> +       if (!sev_guest(kvm) || sev_snp_guest(kvm))
>                 return;
>
>         wbinvd_on_all_cpus();

I hope this is the final solution :)

So, short answer: no.

SNP+guest_memfd prevent untrusted host user space from directly
modifying the data, this is good enough for CVE-2022-0171, but there
is no such guarantee that the host kernel in some scenarios could
access the data and generate dirty caches. In fact, AFAIC, SNP VM does
not track whether each page is previously shared, isn't it? If a page
was previously shared and was written by the host kernel or devices
before it was changed to private. No one tracks it and dirty caches
are there!

So, to avoid any corner case situations like the above, it seems
currently we have to retain the property: flushing the cache when the
guest memory mapping leaves KVM NPT.

Of course, this is fundamentally because SME_COHERENT only applies to
CPU cores, but not DMA. If SME_COHERENT is complete, flushing is no
longer needed. Alternatively, we need extra bookkeeping for KVM to
know whether each page has dirty cache lines. Another alternative is
to filter mmu_notifier reasons, which is the part that I am planning
to take. thoughts?

Thanks.
-Mingwei