[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6vdj4mfxlyvypn743klxq5twda66tkugwzljdt275rug2gmwwl@zdziylxpre6y>
Date: Thu, 24 Apr 2025 12:05:15 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com,
dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com,
ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, david@...hat.com,
vannapurve@...gle.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com,
pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com,
ira.weiny@...el.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory
On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
> > > Basic huge page mapping/unmapping
> > > ---------------------------------
> > > - TD build time
> > > This series enforces that all private mappings be 4KB during the TD build
> > > phase, due to the TDX module's requirement that tdh_mem_page_add(), the
> > > SEAMCALL for adding private pages during TD build time, only supports 4KB
> > > mappings. Enforcing 4KB mappings also simplifies the implementation of
> > > code for TD build time, by eliminating the need to consider merging or
> > > splitting in the mirror page table during TD build time.
> > >
> > > The underlying pages allocated from guest_memfd during TD build time
> > > phase can still be large, allowing for potential merging into 2MB
> > > mappings once the TD is running.
> >
> > It can be done before TD is running. The merging is allowed on TD build
> > stage.
> >
> > But, yes, for simplicity we can skip it for initial enabling.
> Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
> I also don't see any benefit to allow merging during TD build stage.
>
> >
> > > Page splitting (page demotion)
> > > ------------------------------
> > > Page splitting occurs in two paths:
> > > (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
> > >
> > > For normal VMs, if zapping a narrow region that would need to split a
> > > huge page, KVM can simply zap the surrounding GFNs rather than
> > > splitting a huge page. The pages can then be faulted back in, where KVM
> > > can handle mapping them at a 4KB level.
> > >
> > > The reason why TDX can't use the normal VM solution is that zapping
> > > private memory that is accepted cannot easily be re-faulted, since it
> > > can only be re-faulted as unaccepted. So KVM will have to sometimes do
> > > the page splitting as part of the zapping operations.
> > >
> > > These zapping operations can occur for few reasons:
> > > 1. VM teardown.
> > > 2. Memslot removal.
> > > 3. Conversion of private pages to shared.
> > > 4. Userspace does a hole punch to guest_memfd for some reason.
> > >
> > > For case 1 and 2, splitting before zapping is unnecessary because
> > > either the entire range will be zapped or huge pages do not span
> > > memslots.
> > >
> > > Case 3 or case 4 requires splitting, which is also followed by a
> > > backend page splitting in guest_memfd.
> > >
> > > (b) with shared kvm->mmu_lock, triggered by fault.
> > >
> > > Splitting in this path is not accompanied by a backend page splitting
> > > (since backend page splitting necessitates a splitting and zapping
> > > operation in the former path). It is triggered when KVM finds that a
> > > non-leaf entry is replacing a huge entry in the fault path, which is
> > > usually caused by vCPUs' concurrent ACCEPT operations at different
> > > levels.
> >
> > Hm. This sounds like funky behaviour on the guest side.
> >
> > You only saw it in a synthetic test, right? No real guest OS should do
> > this.
> Right. In selftest only.
> Also in case of any guest bugs.
>
> > It can only be possible if guest is reckless enough to be exposed to
> > double accept attacks.
> >
> > We should consider putting a warning if we detect such case on KVM side.
> Is it acceptable to put warnings in host kernel in case of guest bugs or
> attacks?
pr_warn_once() shouldn't be a big deal.
> > > This series simply ignores the splitting request in the fault path to
> > > avoid unnecessary bounces between levels. The vCPU that performs ACCEPT
> > > at a lower level would finally figures out the page has been accepted
> > > at a higher level by another vCPU.
> > >
> > > A rare case that could lead to splitting in the fault path is when a TD
> > > is configured to receive #VE and accesses memory before the ACCEPT
> > > operation. By the time a vCPU accesses a private GFN, due to the lack
> > > of any guest preferred level, KVM could create a mapping at 2MB level.
> > > If the TD then only performs the ACCEPT operation at 4KB level,
> > > splitting in the fault path will be triggered. However, this is not
> > > regarded as a typical use case, as usually TD always accepts pages in
> > > the order from 1GB->2MB->4KB. The worst outcome to ignore the resulting
> > > splitting request is an endless EPT violation. This would not happen
> > > for a Linux guest, which does not expect any #VE.
> >
> > Even if guest accepts memory in response to #VE, it still has to serialize
> > ACCEPT requests to the same memory block. And track what has been
> > accepted.
> >
> > Double accept is a guest bug.
> In the rare case, there're no double accept.
> 1. Guest acceses a private GPA
> 2. KVM creates a 2MB mapping in PENDING state and returns to guest.
> 3. Guest re-accesses, causing the TDX module to inject a #VE.
> 4. Guest accepts at 4KB level only.
> 5. EPT violation to KVM for page splitting.
>
> Here, we expect a normal guest to accept from GB->2MB->4KB in step 4.
Okay, I think I misunderstood this case. I thought there is competing 4k
vs 2M ACCEPT requests to the same memory block.
Accepting everything at 4k level is a stupid, but valid behaviour on the
guest behalf. This splitting case has to be supported before the patchset
hits the mainline.
BTW, there's no 1G ACCEPT. I know that guest is written as if it is a
thing, but TDX module only supports 4k and 2M. 1G is only reachable via
promotion.
--
Kiryl Shutsemau / Kirill A. Shutemov
Powered by blists - more mailing lists