linux-kernel - Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJD7tkZLr7NUcO3JFRQQaB09XH8M+pQN2ALta2dTf7+xEZZFGQ@mail.gmail.com>
Date: Wed, 22 Jan 2025 11:15:32 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Jim Mattson <jmattson@...gle.com>, Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: nVMX: Always use TLB_FLUSH_GUEST for nested VM-Enter/VM-Exit

On Fri, Jan 17, 2025 at 4:03 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Fri, Jan 17, 2025, Yosry Ahmed wrote:
> > On Fri, Jan 17, 2025 at 10:01 AM Sean Christopherson <seanjc@...gle.com> wrote:
> > > Yep.  I suspect the issue is lack of documentation for TLB_FLUSH_GUEST and
> > > TLB_FLUSH_CURRENT.  I'm not entirely sure where it would be best to document
> > > them.  I guess maybe where they are #defined?
> >
> > I guess at the #define we can just mention that they result in calling
> > kvm_vcpu_flush_tlb_{guest/current}() before entering the guest, if
> > anything.
>
> Yeah, a "See xx for details" redirect is probably the best option.
>
> > The specific documentation about what they do could be above the
> > functions themselves, and describing the potential MMU sync is
> > naturally part of documenting kvm_vcpu_flush_tlb_guest() (kinda
> > already there).
> >
> > The flush_tlb_guest() callback is documented in kvm_host.h, but not
> > flush_tlb_current(). I was going to suggest just documenting that. But
> > kvm_vcpu_flush_tlb_guest() does not only call flush_tlb_guest(), but
> > it also potentially synchronizes the MMU. So only documenting the
> > callbacks does not paint a full picture.
> >
> > FTR, I initially confused myself because all kvm_vcpu_flush_tlb_*()
> > functions are more-or-less thin wrappers around the per-vendor
> > callbacks -- except kvm_vcpu_flush_tlb_guest().
> >
> > >
> > > TLB_FLUSH_GUEST is used when a flush of the guest's TLB, from the guest's
> > > perspective, is architecturally required.  The one oddity with TLB_FLUSH_GUEST
> > > is that it does NOT include guest-physical mappings, i.e. TLB entries that are
> > > associated with an EPT root.
> >
> > The way I think about this is how it's documented above the per-vendor
> > callback. It flushes translations created by the guest. The guest does
> > not (directly) create guest-physical translations, only linear and
> > combined translations.
>
> That's not accurate either.  When L1 is using nested TDP, it does create guest-
> physical translations.

Ah yes, I missed that.

> The lack of any form of handling in TLB_FLUSH_GUEST is
> a reflection of two things: EPT is weird, and nested SVM doesn't yet support
> precise flushing on transitions, i.e. nested NPT handling is missing because KVM
> unconditionally flushes and synchronizes.

Even for nested NPT, we shouldn't flush/synchronize the guest-physical
mappings unless L1 specifies that in VMCB12. So I suspect
TLB_FLUSH_GUEST will remain ignorant to nested guest-physical mappings
anyway.

>
> EPT is "weird" because the _only_ time guest-physical translations are flushed
> is when the "wrong" KVM MMU is loaded.  The only way to flush guest-physical
> translations (short of RESET :-D) is via INVEPT, and INVEPT is a root-only (VMX
> terminology) instruction, i.e. can only be executed by L1.  And because L1 can't
> itself be using EPT[*], INVEPT can never target/flush the current context

I think you meant L0 here.

>
> Furthermore, INVEPT isn't strictly tied to a VMCS, e.g. deferring the emulated
> flush until the next time KVM runs a vmcs12 isn't viable. Rather than add
> dedicated tracking, KVM simply unloads the roots and lets the normal root
> "allocation" handle the flush+sync the next time the vCPU uses the associated MMU.

I think you are referring to what you mentioned to me offline, that
INVEPT could be affecting more than one cached root in KVM's MMU, not
just the loaded root when running VMCS12, which is why we cannot defer
synchronizing or unloading the roots.

>
> Nested NPT is different, as there is no INVNPT.  Instead, there's the ASID itself
> and a flushing control, both of which are properties of the VMCB.  As a result,
> NPT TLB flushes that are initiated by a hypervisor always take effect at VMRUN,
> e.g. by bumping the ASID, or via the dedicated flushing control.
>
> So when proper handling of TLB flushing on nested SVM transition comes along, I
> do expect that either kvm_vcpu_flush_tlb_guest() will grow.  Or maybe we'll add
> yet another TLB_FLUSH_XXX flavor :-)

I think kvm_vcpu_flush_tlb_guest() should remain ignorant because it's
used for a lot of operations flushing the TLB on behalf of the guest,
which should not flush nested guest-physical translations. Anyway, I
am trying to come up with a PoC for the "proper handling", so we'll
see how that turns out :)

>
> One thing that could be helpful would be to document that KVM doesn't use
> TLB_FLUSH_GUEST to handle INVEPT, and so there's no need to sync nested TDP MMUs.

I think what to document may become clearer when the nSVM
implementation comes along.

>
> [*] Even in a deprivileged scenario like pKVM, the guest kernel would become L2
>     from KVM's perspective.