[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68b0f5c4e6716_293b32946@iweiny-mobl.notmuch>
Date: Thu, 28 Aug 2025 19:35:16 -0500
From: Ira Weiny <ira.weiny@...el.com>
To: Sean Christopherson <seanjc@...gle.com>, Ira Weiny <ira.weiny@...el.com>
CC: Rick P Edgecombe <rick.p.edgecombe@...el.com>, "kvm@...r.kernel.org"
<kvm@...r.kernel.org>, "pbonzini@...hat.com" <pbonzini@...hat.com>, "Vishal
Annapurve" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, Yan Y Zhao <yan.y.zhao@...el.com>,
"michael.roth@....com" <michael.roth@....com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold
tdx_mem_page_record_premap_cnt() into its sole caller
Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Ira Weiny wrote:
> > Edgecombe, Rick P wrote:
> > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > > Me confused. This is pre-boot, not the normal fault path, i.e. blocking other
> > > > operations is not a concern.
> > >
> > > Just was my recollection of the discussion. I found it:
> > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > >
> > > >
> > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > > code is also broken in the sense that there are no cond_resched() calls. The
> > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > > explicit cond_resched(), there's no practical difference between extending the
> > > > measurement under mmu_lock versus outside of mmu_lock.
> > > >
> > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > > still do tdh_mem_page_add() under mmu_lock.
> > >
> > > I just did a quick test and we should be on the order of <1 ms per page for the
> > > full loop. I can try to get some more formal test data if it matters. But that
> > > doesn't sound too horrible?
> > >
> > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > > inside it.
> >
> > I'm probably not following this conversation, so stupid question: It
> > doesn't need to be in the lock because user space should not be setting up
> > memory and extending the measurement in an asynchronous way. Is that
> > correct?
>
> No, from userspace's perspective ADD+MEASURE is fully serialized. ADD "needs"
> to be under mmu_lock to guarantee consistency between the mirror EPT and the
> "real" S-EPT entries. E.g. if ADD is done after the fact, then KVM can end up
> with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
> That causes a pile of problems because it breaks KVM's fundamental assumption
> that M-EPT and S-EPT entries updated in lock-step.
Ok yes, I think I worded my query incorrectly but this makes things clear.
Thanks!
>
> TDH_MR_EXTEND doesn't have the same same consistency issue. If it fails, the
> only thing that's left in a bad state is the measurement. That's obviously not
> ideal either, but we can handle that by forcefully terminating the VM, without
> opening up KVM to edge cases that would otherwise be impossible.
>
> > > But maybe a better reason is that we could better handle errors
> > > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > > tdx_mem_page_add() due to code in another file).
> > >
> > > I wonder if Yan can give an analysis of any zapping races if we do that.
> >
> > When you say analysis, you mean detecting user space did something wrong
> > and failing gracefully? Is that correct?
>
> More specifically, whether or not KVM can WARN without the WARN being user
> triggerable. Kernel policy is that WARNs must not be triggerable absent kernel,
> hardware, or firmware bugs. What we're trying to figure out is if there's a
> flow that can be triggered by userspace (misbehving or not) that would trip a
> WARN even if KVM is operating as expected. I'm pretty sure the answer is "no".
>
> Oh, and WARNing here is desirable, because it improves the chances of detecting
> a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.
OK... In other areas of the kernel if the user misbehaves it is
reasonable to fail an operation. I would think that being fatal to the VM
would be fine if QEMU did not properly synchronize ADD, measurement, and
finalize, for example. Am I wrong in that assumption?
Ira
Powered by blists - more mailing lists