linux-kernel - Re: [PATCH v3 05/16] KVM: selftests: Stop setting AD bits on nested EPTs on creation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xjyrcutcnv2vrctv2zl5unn5bxn266kab72rfog4o43ox5vrax@xk7tip2az3ru>
Date: Tue, 23 Dec 2025 23:35:22 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 05/16] KVM: selftests: Stop setting AD bits on nested
 EPTs on creation

On Tue, Dec 23, 2025 at 02:26:20PM -0800, Sean Christopherson wrote:
> On Thu, Nov 27, 2025, Yosry Ahmed wrote:
> > When new nested EPTs are created, the AD bits are set. This was
> > introduced by commit 094444204570 ("selftests: kvm: add test for dirty
> > logging inside nested guests"), which introduced vmx_dirty_log_test.
> > 
> > It's unclear why that was needed at the time, but regardless, the test
> > seems to pass without them so probably no longer needed.
> > dirty_log_perf_test (with -n to run in L2) also passes, and these are
> > the only tests currently using nested EPT mappings.
> 
> Please, please don't take the approach of "beat on it until it passes".  Yes,
> Paolo's changelog and comment from 094444204570 are awful, and yes, figuring out
> what Paolo _likely_ intended requires a lot of guesswork and esoteric shadow MMU
> knowledge, but _at best_, modifying code without having at least a plausible
> explanation only adds to the confusion, and at worst is actively dangerous.
> 
> As you've discovered a few times now, there is plenty of code in KVM and its
> tests that elicit a WTF?!!? response from pretty much everyone, i.e. odds are
> good you'll run into something along these lines again, sooner or later.  If/when
> that happens, and you can't unravel the mystery, please send a mail with a question
> instead of sending a patch that papers over the issue.  E.g. casting "raise dead"
> on the original thread is totally acceptable (and probably even encouraged).
> That _might_ be a slower and/or more painful approach, but it's generally safer,
> e.g. it's all too easy for a maintainer to speed read and apply a seemingly
> uninteresting patch like this.

That's fair, I am used to sending a patch that potentially does the
wrong thing than a question, because people tend to be more responsive
to wrong patches :) 

That being said, I should have made this much more obvious, like add RFC
or DO NOT MERGE to the patch title, rather than subtly burying it in the
commit log.

> 
> In this case, I strongly suspect that what Paolo was _trying_ to do was coerce
> KVM into creating a writable SPTE in response to the initial READ.  I.e. in the
> vmx_dirty_log_test's L2 code, setting the Dirty bit in the guest stage-2 PTE is
> necessary to get KVM to install a writable SPTE on the READ_ONCE():
> 
>   static void l2_guest_code(u64 *a, u64 *b)
>   {
> 	READ_ONCE(*a);
> 	WRITE_ONCE(*a, 1);
> 	GUEST_SYNC(true);
> 	GUEST_SYNC(false);
>  
> 	...
>   }
> 
> When handling a read fault in FNAME(walk_addr_generic)(), KVM adjusts the guest
> PTE access protections via FNAME(protect_clean_gpte):
> 
> 	if (!write_fault)
> 		FNAME(protect_clean_gpte)(mmu, &walker->pte_access, pte);
> 	else
> 		/*
> 		 * On a write fault, fold the dirty bit into accessed_dirty.
> 		 * For modes without A/D bits support accessed_dirty will be
> 		 * always clear.
> 		 */
> 		accessed_dirty &= pte >>
> 			(PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
> 
> 
> where FNAME(protect_clean_gpte) is:
> 
> 	unsigned mask;
> 
> 	/* dirty bit is not supported, so no need to track it */
> 	if (!PT_HAVE_ACCESSED_DIRTY(mmu))
> 		return;
> 
> 	BUILD_BUG_ON(PT_WRITABLE_MASK != ACC_WRITE_MASK);
> 
> 	mask = (unsigned)~ACC_WRITE_MASK;
> 	/* Allow write access to dirty gptes */
> 	mask |= (gpte >> (PT_GUEST_DIRTY_SHIFT - PT_WRITABLE_SHIFT)) &
> 		PT_WRITABLE_MASK;
> 	*access &= mask;
> 
> The idea is that KVM can elide a VM-Exit on a READ=>WRITE sequence by making the
> SPTE writable on the READ fault if the guest PTEs allow the page to be written.
> But KVM can only do that if the guest Dirty bit is '1'; if the Dirty bit is '0',
> then KVM needs to intercept the next write in order to emulate the Dirty assist.
> 
> I emphasized "trying" above because, in the context of the dirty logging test,
> simply establishing a writable SPTE on the READ_ONCE() doesn't meaningfully improve
> coverage without checking KVM's behavior after the READ_ONCE().  E.g. KVM marks
> GFNs as dirty (in the dirty bitmap) when creating writable SPTEs, and so doing
> READ+WRITE and _then_ checking the dirty bitmap would hide the KVM PML bug that
> the test was written to find.
> 
> The second part of the L2 guest code:
> 
> 	WRITE_ONCE(*b, 1);
> 	GUEST_SYNC(true);
> 	WRITE_ONCE(*b, 1);
> 	GUEST_SYNC(true);
> 	GUEST_SYNC(false);
> 
> _does_ trigger and detect the KVM bug, but with a WRITE=>CHECK=>WRITE=>CHECK
> sequence instead of READ=>CHECK=>WRITE=>CHECK.  I.e. even if there was a false
> pass in the first phase, as written the second phase will detect the bug,
> assuming the bug affects WRITE=>WRITE and READ=>WRITE equally.  Which isn't a
> great assumption, but it was the case for the PML bug.
> 
> All that said, for this patch, I think it makes sense to drop the A/D code from
> the common APIs, because (a) it will be trivially easy to have the test set them
> as needed once the common APIs are used for TDP mappings, and (b) temporarily
> dropping the code doesn't affect the test coverage in practice.

Thanks a lot for digging and finding all of this. I agree that it's
better to add the coverage properly on top of the series, and have
a separate test case for initializing the PTEs with the dirty bit for
KVM to create a writeable PTE on READ.

> 
> --
> Stop setting Accessed/Dirty bits when creating EPT entries for L2 so that
> the stage-1 and stage-2 (a.k.a. TDP) page table APIs can use common code
> without bleeding the EPT hack into the common APIs.
> 
> While commit 094444204570 ("selftests: kvm: add test for dirty logging
> inside nested guests") is _very_ light on details, the most likely
> explanation is that vmx_dirty_log_test was attempting to avoid taking an
> EPT Violation on the first _write_ from L2.
> 
>   static void l2_guest_code(u64 *a, u64 *b)
>   {
> 	READ_ONCE(*a);
> 	WRITE_ONCE(*a, 1);   <===
> 	GUEST_SYNC(true);
> 
> 	...
>   }
> 
> When handling read faults in the shadow MMU, KVM opportunistically creates
> a writable SPTE if the mapping can be writable *and* the gPTE is dirty (or
> doesn't support the Dirty bit), i.e. if KVM doesn't need to intercept
> writes in order to emulate Dirty-bit updates.  By setting A/D bits in the
> test's EPT entries, the above READ+WRITE will fault only on the read, and
> in theory expose the bug fixed by KVM commit 1f4e5fc83a42 ("KVM: x86: fix
> nested guest live migration with PML").  If the Dirty bit is NOT set, the
> test will get a false pass due; though again, in theory.
> 
> However, the test is flawed (and always was, at least in the versions
> posted publicly), as KVM (correctly) marks the corresponding L1 GFN as
> dirty (in the dirty bitmap) when creating the writable SPTE.  I.e. without
> a check on the dirty bitmap after the READ_ONCE(), the check after the
> first WRITE_ONCE() will get a false pass due to the dirty bitmap/log having
> been updated by the read fault, not by PML.
> 
> Furthermore, the subsequent behavior in the test's l2_guest_code()
> effectively hides the flawed test behavior, as the straight writes to a
> new L2 GPA fault also trigger the KVM bug, and so the test will still
> detect the failure due to lack of isolation between the two testcases
> (Read=>Write vs. Write=>Write).
> 
> 	WRITE_ONCE(*b, 1);
> 	GUEST_SYNC(true);
> 	WRITE_ONCE(*b, 1);
> 	GUEST_SYNC(true);
> 	GUEST_SYNC(false);
> 
> Punt on fixing vmx_dirty_log_test for the moment as it will be easier to
> properly fix the test once the TDP code uses the common MMU APIs, at which
> point it will be trivially easy for the test to retrieve the EPT PTE and
> set the Dirty bit as needed.
> --

Looks good to me, a significant improvement over what I had :')