linux-kernel - Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aFF38Pq71JdLBlqO@google.com>
Date: Tue, 17 Jun 2025 07:13:04 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Amit Shah <Amit.Shah@....com>
Cc: "x86@...nel.org" <x86@...nel.org>, 
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "hpa@...or.com" <hpa@...or.com>, 
	"mingo@...hat.com" <mingo@...hat.com>, "tglx@...utronix.de" <tglx@...utronix.de>, "bp@...en8.de" <bp@...en8.de>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"jon@...anix.com" <jon@...anix.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 06/18] KVM: VMX: Wire up Intel MBEC enable/disable logic

On Mon, Jun 16, 2025, Amit Shah wrote:
> On Wed, 2025-05-14 at 05:55 -0700, Sean Christopherson wrote:
> > On Wed, May 14, 2025, Amit Shah wrote:
> > > (btw KVM MMU API question -- from the #NPF, I have the GPA of the L2
> > > guest.  How to go from that guest GPA to look up the NX bit for that
> > > page?  I skimmed and there doesn't seem to be an existing API for it - so
> > > is walking the tables the only solution?)
> > 
> > As above, KVM doesn't manually look up individual bits while handling
> > faults.  The walk of the guest page tables (L1's NPT/EPT for this scenario)
> > performed by FNAME(walk_addr_generic) will gather the effective permissions
> > in walker->pte_access, and check for a permission_fault() after the walk is
> > completed.
> 
> Hm, despite the discussions in the PUCK calls since this email, I have
> this doubt, which may be fairly basic.  To determine whether the exit
> was due to GMET, we have to check the effective U/S and NX bit for the
> address that faulted.  That means we have to walk the L2's page tables
> to get those bits from the L2's PTEs, and then from the error code in
> exitinfo1, confirm why the #NPF happened.  (And even with Paolo's neat
> SMEP hack, the exit reason due to GMET can only be confirmed by looking
> at the guest's U/S and NX bits.)
> 
> And from what I see, currently page table walks only happen on L1's
> page tables, and not on L2's page tables, is that right?

Nit, they aren't _L2's_ page tables, in that (barring crazy paravirt behavior)
L2 does not control the page tables.  In most conversations, that distinction
wouldn't matter, but when talking about which pages KVM walks when running an L2
while L1 is using NPT (or EPT), it's worth being very precise, because KVM may
also need to walk L2's non-nested page tables, i.e. the page table that map L2
GVAs to L2 GPA.

The least awful terminology we've come up with when referring to nested TDP is
to follow KVM's VMCS/VMCB terminology when doing nested virtualization:

  npt12: The NPT page tables controlled by L1 to manage L2 GPAs.  These are
         never referenced by hardware.
  npt02: KVM controlled page tables that shadow npt12, and are consumed by hardware.

> I'm sure I'm missing something here, though..

Heh, yep.  Part of that's my fault for using ambiguous terminology.  When I said
"L1's NPT/EPT" above, what I really meant was npt12.  I.e. this code

  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
  {
	struct guest_walker walker;
	int r;

	WARN_ON_ONCE(fault->is_tdp);

	/*
	 * Look up the guest pte for the faulting address.
	 * If PFEC.RSVD is set, this is a shadow page fault.
	 * The bit needs to be cleared before walking guest page tables.
	 */
	r = FNAME(walk_addr)(&walker, vcpu, fault->addr,
			     fault->error_code & ~PFERR_RSVD_MASK);

	/*
	 * The page is not mapped by the guest.  Let the guest handle it.
	 */
	if (!r) {
		if (!fault->prefetch)
			kvm_inject_emulated_page_fault(vcpu, &walker.fault);  <===== GMET #NPF

		return RET_PF_RETRY;
	}

which leads to the aformentioned FNAME(walk_addr_generic) and walker->pte_access
behavior, is walking npt12.  Because the #NPF will have occurred while running
L2, and by virtue of it being an #NPF (as opposed to a "legacy" #PF), KVM knows
the fault is in the context of npt02.

Before doing anything with respect to npt12, KVM needs to do walk_addr() on _npt12_
to determine whether the access is allowed by np12.  E.g. the simplest scenario
to grok is if L2 accesses a (L2) GPA that isn't mapped by npt12, in case KVM needs
to inject a #NPF into L1.

Same thing here.  On a PRESENT+FETCH+USER fault, if the effective protections
in npt12 have U/S=1 and GMET is enabled, then KVM needs to inject a #NPF into
L1.  

Side topic, someone should check with the AMD architects as to whether or not
GMET depends on EFER.NXE=1.  The APM says that all NPT mappings are executable
if EFER.NXE=0 in the host (where the "host" is L1 when dealing with nested NPT).
To me, that implies GMET is effectively ignored if EFER.NXE=0.

  Similarly, if the EFER.NXE bit is cleared for the host, all nested page table
  mappings are executable at the underlying nested level.