linux-kernel - Re: [PATCH] KVM: x86/mmu: Allow per VM kvm_mmu_max

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zpb9Vwcmp4T-0ufJ@google.com>
Date: Tue, 16 Jul 2024 16:08:07 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: Isaku Yamahata <isaku.yamahata@...el.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>
Subject: Re: [PATCH] KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn()

On Tue, Jul 16, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote:
> > 
> > No, it most definitely is not more correct.  There is absolutely no issue
> > zapping
> > SPTEs that should never exist.  In fact, restricting the zapping path is far
> > more
> > likely to *cause* correctness issues, e.g. see 
> > 
> >   524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all
> > SPTEs")
> >   86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed
> > host.MAXPHYADDR")
> 
> The type of correctness this was going for was around the new treatment of GFNs
> not having the shared/alias bit. As you know it can get confusing which
> variables have these bits and which have them stripped. Part of the recent MMU
> work involved making sure at least GFN's didn't contain the shared bit.

That's fine, and doesn't conflict with what I'm asserting, which is that it's
a-ok to process SPTEs a range of GFNs that, barring KVM bugs, should never have
"valid" entries.

> Then in TDP MMU where it iterates from start to end, for example:
> static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> 			      gfn_t start, gfn_t end, bool can_yield, bool
> flush)
> {
> 	struct tdp_iter iter;
> 
> 	end = min(end, tdp_mmu_max_gfn_exclusive());
> 
> 	lockdep_assert_held_write(&kvm->mmu_lock);
> 
> 	rcu_read_lock();
> 
> 	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
> 		if (can_yield &&
> 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> 			flush = false;
> 			continue;
> 		}
> ...
> 
> The math gets a bit confused. For the private/mirror root, start will begin at
> 0, and iterate to a range that includes the shared bit. No functional problem
> because we are zapping things that shouldn't be set. But it means the 'gfn' has
> the bit position of the shared bit set. Although it is not acting as the shared
> bit in this case, just an out of range bit.
>
> For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit
> | max_gfn). So where the mirror root iterates through the whole range, the
> shared case skips it in the current code anyway.
> 
> And then the fact that the code already takes care here to avoid zapping over
> ranges that exceed the max gfn.
> 
> So it's a bit asymmetric, and just overall weird. We are weighing functional
> correctness risk with known code weirdness.

IMO, you're looking at it with too much of a TDX lens and not thinking about all
the people that don't care about TDX, which is the majority of KVM developers.

The unaliased GFN is definitely not the max GFN of all the VM's MMUs, since the
shared EPT must be able to process GPAs with bits set above the "max" GFN.  And
to me, _that's_ far more weird than saying that "S-EPT MMUs never set the shared
bit, and shared EPT MMUs never clear the shared bit".  I'm guessing the S-EPT
support ORs in the shared bit, but it's still a GFN.

If you were adding a per-MMU max GFN, then I'd buy that it legitimately is the max
GFN, but why not have a full GFN range for the MMU?  E.g.

  static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
			         bool shared, int zap_level)
  {
	struct tdp_iter iter;

	gfn_t end = tdp_mmu_max_gfn_exclusive(root);
	gfn_t start = tdp_mmu_min_gfn_inclusive(root);

and then have the helpers incorporated the S-EPT vs. EPT information.  That gets
us optimized, precise zapping without needing to muddy the waters by tracking a
per-VM "max" GFN that is only kinda sorta the max if you close your eyes and don't
think too hard about the shared MMU usage.

> My inclination was to try to reduce the places where TDX MMU needs paths
> happen to work for subtle reasons for the cost of the VM field. 

But it doesn't happen to work for subtle reasons.  It works because it has to
work.  Processing !PRESENT SPTEs should always work, regardless of why KVM can
guarantee there are no SPTEs in a given GFN range.