linux-kernel - Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f7fe0d58-c6a0-4d25-8c5f-73f7b747970f@suse.com>
Date: Thu, 24 Apr 2025 11:08:55 +0200
From: Juergen Gross <jgross@...e.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>,
 Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org,
 kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com,
 dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com,
 ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com,
 david@...hat.com, vannapurve@...gle.com, vbabka@...e.cz, jroedel@...e.de,
 thomas.lendacky@....com, pgonda@...gle.com, zhiquan1.li@...el.com,
 fan.du@...el.com, jun.miao@...el.com, ira.weiny@...el.com,
 chao.p.peng@...el.com
Subject: Re: [RFC PATCH 00/21] KVM: TDX huge page support for private memory

On 24.04.25 11:05, Kirill A. Shutemov wrote:
> On Thu, Apr 24, 2025 at 04:33:13PM +0800, Yan Zhao wrote:
>> On Thu, Apr 24, 2025 at 10:35:47AM +0300, Kirill A. Shutemov wrote:
>>> On Thu, Apr 24, 2025 at 11:00:32AM +0800, Yan Zhao wrote:
>>>> Basic huge page mapping/unmapping
>>>> ---------------------------------
>>>> - TD build time
>>>>    This series enforces that all private mappings be 4KB during the TD build
>>>>    phase, due to the TDX module's requirement that tdh_mem_page_add(), the
>>>>    SEAMCALL for adding private pages during TD build time, only supports 4KB
>>>>    mappings. Enforcing 4KB mappings also simplifies the implementation of
>>>>    code for TD build time, by eliminating the need to consider merging or
>>>>    splitting in the mirror page table during TD build time.
>>>>    
>>>>    The underlying pages allocated from guest_memfd during TD build time
>>>>    phase can still be large, allowing for potential merging into 2MB
>>>>    mappings once the TD is running.
>>>
>>> It can be done before TD is running. The merging is allowed on TD build
>>> stage.
>>>
>>> But, yes, for simplicity we can skip it for initial enabling.
>> Yes, to avoid complicating kvm_tdx->nr_premapped calculation.
>> I also don't see any benefit to allow merging during TD build stage.
>>
>>>
>>>> Page splitting (page demotion)
>>>> ------------------------------
>>>> Page splitting occurs in two paths:
>>>> (a) with exclusive kvm->mmu_lock, triggered by zapping operations,
>>>>
>>>>      For normal VMs, if zapping a narrow region that would need to split a
>>>>      huge page, KVM can simply zap the surrounding GFNs rather than
>>>>      splitting a huge page. The pages can then be faulted back in, where KVM
>>>>      can handle mapping them at a 4KB level.
>>>>
>>>>      The reason why TDX can't use the normal VM solution is that zapping
>>>>      private memory that is accepted cannot easily be re-faulted, since it
>>>>      can only be re-faulted as unaccepted. So KVM will have to sometimes do
>>>>      the page splitting as part of the zapping operations.
>>>>
>>>>      These zapping operations can occur for few reasons:
>>>>      1. VM teardown.
>>>>      2. Memslot removal.
>>>>      3. Conversion of private pages to shared.
>>>>      4. Userspace does a hole punch to guest_memfd for some reason.
>>>>
>>>>      For case 1 and 2, splitting before zapping is unnecessary because
>>>>      either the entire range will be zapped or huge pages do not span
>>>>      memslots.
>>>>      
>>>>      Case 3 or case 4 requires splitting, which is also followed by a
>>>>      backend page splitting in guest_memfd.
>>>>
>>>> (b) with shared kvm->mmu_lock, triggered by fault.
>>>>
>>>>      Splitting in this path is not accompanied by a backend page splitting
>>>>      (since backend page splitting necessitates a splitting and zapping
>>>>       operation in the former path).  It is triggered when KVM finds that a
>>>>      non-leaf entry is replacing a huge entry in the fault path, which is
>>>>      usually caused by vCPUs' concurrent ACCEPT operations at different
>>>>      levels.
>>>
>>> Hm. This sounds like funky behaviour on the guest side.
>>>
>>> You only saw it in a synthetic test, right? No real guest OS should do
>>> this.
>> Right. In selftest only.
>> Also in case of any guest bugs.
>>
>>> It can only be possible if guest is reckless enough to be exposed to
>>> double accept attacks.
>>>
>>> We should consider putting a warning if we detect such case on KVM side.
>> Is it acceptable to put warnings in host kernel in case of guest bugs or
>> attacks?
> 
> pr_warn_once() shouldn't be a big deal.

Shouldn't such a warning be once per guest?

So either we need a per guest flag, or we could use pr_warn_ratelimited().


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3684 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)