linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com>
Date: Fri, 27 Jun 2025 10:59:47 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, 
	"Shutemov, Kirill" <kirill.shutemov@...el.com>
Cc: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"Du, Fan" <fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>, 
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>, 
	"Zhao, Yan Y" <yan.y.zhao@...el.com>, "tabba@...gle.com" <tabba@...gle.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Peng, Chao P" <chao.p.peng@...el.com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "michael.roth@....com" <michael.roth@....com>, 
	"vbabka@...e.cz" <vbabka@...e.cz>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "Annapurve, Vishal" <vannapurve@...gle.com>, 
	"Weiny, Ira" <ira.weiny@...el.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, 
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:

> On Thu, 2025-06-26 at 18:16 +0300, Shutemov, Kirill wrote:
>> > > Please see my reply to Yan, I'm hoping y'all will agree to something
>> > > between option f/g instead.
>> > 
>> > I'm not sure about the HWPoison approach, but I'm not totally against it. My
>> > bias is that all the MM concepts are tightly interlinked. If may fit
>> > perfectly,
>> > but every new use needs to be checked for how fits in with the other MM
>> > users of
>> > it. Every time I've decided a page flag was the perfect solution to my
>> > problem,
>> > I got informed otherwise. Let me try to flag Kirill to this discussion. He
>> > might
>> > have some insights.
>> 
>> We chatted with Rick about this.
>> 
>> If I understand correctly, we are discussing the situation where the TDX
>> module failed to return a page to the kernel.
>> 
>> I think it is reasonable to use HWPoison for this case. We cannot
>> guarantee that we will read back whatever we write to the page. TDX module
>> has creative ways to corrupt it. 
>> 
>> The memory is no longer functioning as memory. It matches the definition
>> of HWPoison quite closely.
>
> ok! Lets go f/g. Unless Yan objects.

Follow up as I think about this more: Perhaps we don't need to check for
HWpoison (or TDX unmap errors) on conversion.

On a high level, we don't need to check for HWpoison because conversion
is about changing memory metadata, as in memory privacy status and
struct folio sizes, and not touching memory contents at all. HWpoison
means the memory and its contents shouldn't be used.

Specifically for private-to-shared conversions where the TDX unmap error
can happen, we will

1. HWpoison the page
2. Bug the TD

This falsely successful conversion means the host (guest_memfd) will
think the memory is shared while it may still be mapped in Secure-EPTs.

I think that is okay because the only existing user (TD) stops using
that memory, and no future users can use the memory:

1. The TD will be bugged by then. A non-running TD cannot touch memory
   that had the error on unmapping.

2. The page was not mapped into host page tables (since it was
   private). Even if it were mapped, it will be unmapped from host page
   tables (host page table unmaps don't fail). If the host tries to
   touch the memory, on the next fault, core-mm would notice that the
   page is poisoned and not fault it in.

By the way, when we "bug the TD", can we assume that ALL vCPUs, not just
the one that is did the failed unmap will stop running?

I guess even if the other vCPUs don't stop running, the TDs vCPUs will
access the page as shared thinking the conversion succeeded and keep
hitting #VEs. If the TD accesses the page as private, it's fine since
the page was not unmapped from Secure-EPTs due to the unmap failure and
the host cannot write to it (host will see HWpoison on next fault) and
so there's no host crash and doesn't defeat the purpose of guest_memfd.

If the guest_memfd with a HWpoisoned page is linked to a new, runnable
TD, the new TD would need to fault in the page as private. When it tries
to fault in the page to the new TD, it will hit the HWpoison and
userspace will get to know about the HWpoison.

Yan, Rick, let me know what you think of this!