linux-kernel - Re: [PATCH v3 05/16] KVM: selftests: Stop setting AD bits on nested EPTs on creation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aUsXDKeYorxt7VMM@google.com>
Date: Tue, 23 Dec 2025 14:26:20 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 05/16] KVM: selftests: Stop setting AD bits on nested
 EPTs on creation

On Thu, Nov 27, 2025, Yosry Ahmed wrote:
> When new nested EPTs are created, the AD bits are set. This was
> introduced by commit 094444204570 ("selftests: kvm: add test for dirty
> logging inside nested guests"), which introduced vmx_dirty_log_test.
> 
> It's unclear why that was needed at the time, but regardless, the test
> seems to pass without them so probably no longer needed.
> dirty_log_perf_test (with -n to run in L2) also passes, and these are
> the only tests currently using nested EPT mappings.

Please, please don't take the approach of "beat on it until it passes".  Yes,
Paolo's changelog and comment from 094444204570 are awful, and yes, figuring out
what Paolo _likely_ intended requires a lot of guesswork and esoteric shadow MMU
knowledge, but _at best_, modifying code without having at least a plausible
explanation only adds to the confusion, and at worst is actively dangerous.

As you've discovered a few times now, there is plenty of code in KVM and its
tests that elicit a WTF?!!? response from pretty much everyone, i.e. odds are
good you'll run into something along these lines again, sooner or later.  If/when
that happens, and you can't unravel the mystery, please send a mail with a question
instead of sending a patch that papers over the issue.  E.g. casting "raise dead"
on the original thread is totally acceptable (and probably even encouraged).
That _might_ be a slower and/or more painful approach, but it's generally safer,
e.g. it's all too easy for a maintainer to speed read and apply a seemingly
uninteresting patch like this.

In this case, I strongly suspect that what Paolo was _trying_ to do was coerce
KVM into creating a writable SPTE in response to the initial READ.  I.e. in the
vmx_dirty_log_test's L2 code, setting the Dirty bit in the guest stage-2 PTE is
necessary to get KVM to install a writable SPTE on the READ_ONCE():

  static void l2_guest_code(u64 *a, u64 *b)
  {
	READ_ONCE(*a);
	WRITE_ONCE(*a, 1);
	GUEST_SYNC(true);
	GUEST_SYNC(false);
 
	...
  }

When handling a read fault in FNAME(walk_addr_generic)(), KVM adjusts the guest
PTE access protections via FNAME(protect_clean_gpte):

	if (!write_fault)
		FNAME(protect_clean_gpte)(mmu, &walker->pte_access, pte);
	else
		/*
		 * On a write fault, fold the dirty bit into accessed_dirty.
		 * For modes without A/D bits support accessed_dirty will be
		 * always clear.
		 */
		accessed_dirty &= pte >>
			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);


where FNAME(protect_clean_gpte) is:

	unsigned mask;

	/* dirty bit is not supported, so no need to track it */
	if (!PT_HAVE_ACCESSED_DIRTY(mmu))
		return;

	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);

	mask = (unsigned)~ACC_WRITE_MASK;
	/* Allow write access to dirty gptes */
	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
		PT_WRITABLE_MASK;
	*access &= mask;

The idea is that KVM can elide a VM-Exit on a READ=>WRITE sequence by making the
SPTE writable on the READ fault if the guest PTEs allow the page to be written.
But KVM can only do that if the guest Dirty bit is '1'; if the Dirty bit is '0',
then KVM needs to intercept the next write in order to emulate the Dirty assist.

I emphasized "trying" above because, in the context of the dirty logging test,
simply establishing a writable SPTE on the READ_ONCE() doesn't meaningfully improve
coverage without checking KVM's behavior after the READ_ONCE().  E.g. KVM marks
GFNs as dirty (in the dirty bitmap) when creating writable SPTEs, and so doing
READ+WRITE and _then_ checking the dirty bitmap would hide the KVM PML bug that
the test was written to find.

The second part of the L2 guest code:

	WRITE_ONCE(*b, 1);
	GUEST_SYNC(true);
	WRITE_ONCE(*b, 1);
	GUEST_SYNC(true);
	GUEST_SYNC(false);

_does_ trigger and detect the KVM bug, but with a WRITE=>CHECK=>WRITE=>CHECK
sequence instead of READ=>CHECK=>WRITE=>CHECK.  I.e. even if there was a false
pass in the first phase, as written the second phase will detect the bug,
assuming the bug affects WRITE=>WRITE and READ=>WRITE equally.  Which isn't a
great assumption, but it was the case for the PML bug.

All that said, for this patch, I think it makes sense to drop the A/D code from
the common APIs, because (a) it will be trivially easy to have the test set them
as needed once the common APIs are used for TDP mappings, and (b) temporarily
dropping the code doesn't affect the test coverage in practice.

--
Stop setting Accessed/Dirty bits when creating EPT entries for L2 so that
the stage-1 and stage-2 (a.k.a. TDP) page table APIs can use common code
without bleeding the EPT hack into the common APIs.

While commit 094444204570 ("selftests: kvm: add test for dirty logging
inside nested guests") is _very_ light on details, the most likely
explanation is that vmx_dirty_log_test was attempting to avoid taking an
EPT Violation on the first _write_ from L2.

  static void l2_guest_code(u64 *a, u64 *b)
  {
	READ_ONCE(*a);
	WRITE_ONCE(*a, 1);   <===
	GUEST_SYNC(true);

	...
  }

When handling read faults in the shadow MMU, KVM opportunistically creates
a writable SPTE if the mapping can be writable *and* the gPTE is dirty (or
doesn't support the Dirty bit), i.e. if KVM doesn't need to intercept
writes in order to emulate Dirty-bit updates.  By setting A/D bits in the
test's EPT entries, the above READ+WRITE will fault only on the read, and
in theory expose the bug fixed by KVM commit 1f4e5fc83a42 ("KVM: x86: fix
nested guest live migration with PML").  If the Dirty bit is NOT set, the
test will get a false pass due; though again, in theory.

However, the test is flawed (and always was, at least in the versions
posted publicly), as KVM (correctly) marks the corresponding L1 GFN as
dirty (in the dirty bitmap) when creating the writable SPTE.  I.e. without
a check on the dirty bitmap after the READ_ONCE(), the check after the
first WRITE_ONCE() will get a false pass due to the dirty bitmap/log having
been updated by the read fault, not by PML.

Furthermore, the subsequent behavior in the test's l2_guest_code()
effectively hides the flawed test behavior, as the straight writes to a
new L2 GPA fault also trigger the KVM bug, and so the test will still
detect the failure due to lack of isolation between the two testcases
(Read=>Write vs. Write=>Write).

	WRITE_ONCE(*b, 1);
	GUEST_SYNC(true);
	WRITE_ONCE(*b, 1);
	GUEST_SYNC(true);
	GUEST_SYNC(false);

Punt on fixing vmx_dirty_log_test for the moment as it will be easier to
properly fix the test once the TDP code uses the common MMU APIs, at which
point it will be trivially easy for the test to retrieve the EPT PTE and
set the Dirty bit as needed.
--