linux-kernel - Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdRMwdkz1enYIgBM@yzhao56-desk.sh.intel.com>
Date: Tue, 20 Feb 2024 14:54:57 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, Friedrich Weber <f.weber@...xmox.com>, "Kai
 Huang" <kai.huang@...el.com>, Yuan Yao <yuan.yao@...ux.intel.com>, Xu Yilun
	<yilun.xu@...ux.intel.com>, Yu Zhang <yu.c.zhang@...ux.intel.com>, Chao Peng
	<chao.p.peng@...ux.intel.com>, Fuad Tabba <tabba@...gle.com>, Michael Roth
	<michael.roth@....com>, Isaku Yamahata <isaku.yamahata@...el.com>, "David
 Matlack" <dmatlack@...gle.com>
Subject: Re: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from
 __kvm_faultin_pfn() to kvm_faultin_pfn()

On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote:
> +Jim
> 
> On Mon, Feb 19, 2024, Yan Zhao wrote:
> > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote:
> > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > >  	smp_rmb();
> > >  
> > > +	if (!slot)
> > > +		goto faultin_pfn;
> > > +
> > > +	/*
> > > +	 * Retry the page fault if the gfn hit a memslot that is being deleted
> > > +	 * or moved.  This ensures any existing SPTEs for the old memslot will
> > > +	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
> > > +	 */
> > > +	if (slot->flags & KVM_MEMSLOT_INVALID)
> > > +		return RET_PF_RETRY;
> > > +
> > > +	if (!kvm_is_visible_memslot(slot)) {
> > > +		/* Don't expose KVM's internal memslots to L2. */
> > > +		if (is_guest_mode(vcpu)) {
> > > +			fault->slot = NULL;
> > > +			fault->pfn = KVM_PFN_NOSLOT;
> > > +			fault->map_writable = false;
> > > +			return RET_PF_CONTINUE;
> > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE?
> 
> Oof.  Yes.  But there is a pre-existing bug here too, though it's very theoretical
> and unlikely to ever cause problems.
> 
> If KVM is using TDP, but L1 is using shadow paging for L2, then routing through
> kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an
> MMIO SPTE.  Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode
> ensure KVM uses different roots for L1 vs. L2.  But mmio_gfn will remain valid,
> and could (quite theoretically) cause KVM to incorrectly treat an L1 access to
> the private TSS or identity mapped page tables as MMIO.
Why would KVM treat L1 access to the private TSS and identity mapped page
tables as MMIO even though mmio_gfn is valid?
It looks that (for Intel platform) EPT for L1 will only install normal SPTEs
(non-MMIO SPTEs) for the two private slots, so there would not have EPT
misconfiguration and would not go to emulation path incorrectly.
Am I missing something?

> Furthermore, this check doesn't actually prevent exposing KVM's internal memslots
> to L2, it simply forces KVM to emulate the access.  In most cases, that will trigger
> MMIO, amusingly due to filling arch.mmio_gfn.  And vcpu_is_mmio_gpa() always
> treats APIC accesses as MMIO, so those are fine.  But the private TSS and identity
> mapped page tables could go either way (MMIO or access the private memslot's backing
> memory).
Yes, this is also my question when reading that part of code.
mmio_gen mismatching may lead to accessing the backing memory directly.

> We could "fix" the issue by forcing MMIO emulation for L2 access to all internal
> memslots, not just to the APIC.  But I think that's actually less correct than
> letting L2 access the private TSS and indentity mapped page tables (not to mention
> that I can't imagine anyone cares what KVM does in this case).  From L1's perspective,
> there is R/W memory at those memslot, the memory just happens to be initialized
> with non-zero data, and I don't see a good argument for hiding that memory from L2.
> Making the memory disappear is far more magical than the memory existing in the
> first place.
> 
> The APIC access page is special because KVM _must_ emulate the access to do the
> right thing.  And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private
> memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC,
> it's just as important (likely *more* imporant* for correctness when L1 is passing
> through its own APIC to L2.
>
> Unless I'm missing someting, I think it makes sense to throw in the below before
> moving code around.
> 
> Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching
> fault->slot; in this case KVM will unnecessarily check mmu_notifiers.  That's
> obviously a very benign bug, as a false positive just means an unnecessary retry,
> but yikes.
>
Patch 3 & 4 removed the bug immediately :)

> --
> Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to
>  non-APIC internal slots
> 
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 488f522f09c6..4ce824cec5b9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
>  		return RET_PF_RETRY;
>  
> -	if (!kvm_is_visible_memslot(slot)) {
> -		/* Don't expose private memslots to L2. */
> +	if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
> +		/*
> +		 * Don't map L1's APIC access page into L2, KVM doesn't support
> +		 * using APICv/AVIC to accelerate L2 accesses to L1's APIC,
> +		 * i.e. the access needs to be emulated.  Emulating access to
> +		 * L1's APIC is also correct if L1 is accelerating L2's own
> +		 * virtual APIC, but for some reason L1 also maps _L1's_ APIC
> +		 * into L2.  Note, vcpu_is_mmio_gpa() always treats access to
> +		 * the APIC as MMIO.  Allow an MMIO SPTE to be created, as KVM
> +		 * uses different roots for L1 vs. L2, i.e. there is no danger
> +		 * of breaking APICv/AVIC for L1.
> +		 */
>  		if (is_guest_mode(vcpu)) {
>  			fault->slot = NULL;
>  			fault->pfn = KVM_PFN_NOSLOT;
Checking fault->is_private before calling kvm_handle_noslot_fault()?
And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault()
before returning RET_PF_EMULATE?

> @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		 * MMIO SPTE.  That way the cache doesn't need to be purged
>  		 * when the AVIC is re-enabled.
>  		 */
> -		if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> -		    !kvm_apicv_activated(vcpu->kvm))
> +		if (!kvm_apicv_activated(vcpu->kvm))
>  			return RET_PF_EMULATE;
Otherwise, here also needs a checking of fault->is_private?
Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track()
is true (though I know it's always false for TDX).

>  	}
>  
> 
> base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6
> --