linux-kernel - Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <68b0f5c4e6716_293b32946@iweiny-mobl.notmuch>
Date: Thu, 28 Aug 2025 19:35:16 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Ira Weiny <ira.weiny@...el.com>
CC: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>, "Vishal
 Annapurve" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Yan Y Zhao <yan.y.zhao@...el.com>,
	"michael.roth@....com" <michael.roth@....com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold
 tdx_mem_page_record_premap_cnt() into its sole caller

Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Ira Weiny wrote:
> > Edgecombe, Rick P wrote:
> > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > > operations is not a concern.
> > > 
> > > Just was my recollection of the discussion. I found it:
> > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > > 
> > > > 
> > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > > explicit cond_resched(), there's no practical difference between extending the
> > > > measurement under mmu_lock versus outside of mmu_lock.
> > > > 
> > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > > still do tdh_mem_page_add() under mmu_lock.
> > > 
> > > I just did a quick test and we should be on the order of <1 ms per page for the
> > > full loop. I can try to get some more formal test data if it matters. But that
> > > doesn't sound too horrible?
> > > 
> > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > > inside it.
> > 
> > I'm probably not following this conversation, so stupid question:  It
> > doesn't need to be in the lock because user space should not be setting up
> > memory and extending the measurement in an asynchronous way.  Is that
> > correct?
> 
> No, from userspace's perspective ADD+MEASURE is fully serialized.  ADD "needs"
> to be under mmu_lock to guarantee consistency between the mirror EPT and the
> "real" S-EPT entries.  E.g. if ADD is done after the fact, then KVM can end up
> with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
> That causes a pile of problems because it breaks KVM's fundamental assumption
> that M-EPT and S-EPT entries updated in lock-step.

Ok yes, I think I worded my query incorrectly but this makes things clear.

Thanks!

> 
> TDH_MR_EXTEND doesn't have the same same consistency issue.  If it fails, the
> only thing that's left in a bad state is the measurement.  That's obviously not
> ideal either, but we can handle that by forcefully terminating the VM, without
> opening up KVM to edge cases that would otherwise be impossible.
> 
> > > But maybe a better reason is that we could better handle errors
> > > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > > tdx_mem_page_add() due to code in another file).
> > > 
> > > I wonder if Yan can give an analysis of any zapping races if we do that.
> > 
> > When you say analysis, you mean detecting user space did something wrong
> > and failing gracefully?  Is that correct?
> 
> More specifically, whether or not KVM can WARN without the WARN being user
> triggerable.  Kernel policy is that WARNs must not be triggerable absent kernel,
> hardware, or firmware bugs.  What we're trying to figure out is if there's a
> flow that can be triggered by userspace (misbehving or not) that would trip a
> WARN even if KVM is operating as expected.  I'm pretty sure the answer is "no".
> 
> Oh, and WARNing here is desirable, because it improves the chances of detecting
> a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.

OK...  In other areas of the kernel if the user misbehaves it is
reasonable to fail an operation.  I would think that being fatal to the VM
would be fine if QEMU did not properly synchronize ADD, measurement, and
finalize, for example.  Am I wrong in that assumption?

Ira