linux-kernel - Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aLDQ09FP0uX3eJvC@google.com>
Date: Thu, 28 Aug 2025 14:57:39 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>, 
	Vishal Annapurve <vannapurve@...gle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Yan Y Zhao <yan.y.zhao@...el.com>, 
	"michael.roth@....com" <michael.roth@....com>, Ira Weiny <ira.weiny@...el.com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt()
 into its sole caller

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/

Ugh, another case where an honest question gets interpreted as "do it this way". :-(

> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?

1ms is totally reasonable.  I wouldn't bother with any more testing.

> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it.

Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
out, then zapping S-EPT entries could induce failure, and I don't think it's
worth going through the effort to ensure it's impossible to trigger S-EPT removal.

Note, temoving S-EPT entries during initialization of the image isn't something
I want to official support, rather it's an endless stream of whack-a-mole due to
obsurce edge cases

Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99%
certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page
has its own non-guest_memfd memslot and so can't be used for the initial image,
which means that too is mutually exclusive.

So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
errors can be triggered by userspace.

> But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.

As above, I think we're good?