linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz7c0q48g7.fsf@ackerleytng-ctop.c.googlers.com>
Date: Wed, 02 Jul 2025 13:57:28 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"Li, Zhiquan1" <zhiquan1.li@...el.com>, "Du, Fan" <fan.du@...el.com>, 
	"Hansen, Dave" <dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>, 
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>, 
	"vbabka@...e.cz" <vbabka@...e.cz>, "Shutemov, Kirill" <kirill.shutemov@...el.com>, 
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com" <seanjc@...gle.com>, 
	"Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Miao, Jun" <jun.miao@...el.com>, 
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:

> On Tue, 2025-07-01 at 14:48 -0700, Ackerley Tng wrote:
>> Perhaps we had different understandings of f/g :P
>
> Ah yes, I thought you were saying that guestmemfd would use poison internally
> via some gmem_buggy_page() or similar. I guess I thought it is more of
> guestmemfd's job. But as Yan pointed out, we need to handle non gmem page errors
> too. Currently we leak, but it would be nice to keep the handling symmetrical.
> Which would be easier if we did it all in TDX code.
>

I meant to set HWpoison externally from guest_memfd because I feel that
it is a separate thing. Unmap failures are similar to discovering a
memory error. If setting HWpoison on memory error is external to
guest_memfd, setting HWpoison on unmap failure should also be
conceptually external to guest_memfd.

After Yan pointed out that non-guest_memfd page errors need to be
handled, it aligns with the idea that setting HWpoison is external to
guest_memfd.

I agree keeping the handling symmetrical would be best, so in both cases
the part of KVM TDX code that sees the unmap failure should directly set
HWpoison and not go through guest_memfd.

>> 
>> I meant that TDX module should directly set the HWpoison flag on the
>> folio (HugeTLB or 4K, guest_memfd or not), not call into guest_memfd.
>> 
>> guest_memfd will then check this flag when necessary, specifically:
>> 
>> * On faults, either into guest or host page tables 
>> * When freeing the page
>>     * guest_memfd will not return HugeTLB pages that are poisoned to
>>       HugeTLB and just leak it
>>     * 4K pages will be freed normally, because free_pages_prepare() will
>>       check for HWpoison and skip freeing, from __folio_put() ->
>>       free_frozen_pages() -> __free_frozen_pages() ->
>>       free_pages_prepare()
>> * I believe guest_memfd doesn't need to check HWpoison on conversions [1]
>> 
>> [1] https://lore.kernel.org/all/diqz5xghjca4.fsf@ackerleytng-ctop.c.googlers.com/
>
> If a poisoned page continued to be used, it's a bit weird, no? 

Do you mean "continued to be used" in the sense that it is present in a
filemap and belongs to a (guest_memfd) inode?

A poisoned page is not faulted in anywhere, and in that sense the page
is not "used". In the case of regular poisoning as in a call to
memory_failure(), the page is unmapped from the page tables. If that
page belongs to guest_memfd, in today's code [2], guest_memfd
intentionally does not truncate it from the filemap. For guest_memfd,
handling the HWpoison at fault time is by design; keeping it present in
the filemap is by design.

In the case of TDX unmap failures leading to HWpoison, the only place it
may remain mapped is in the Secure-EPTs. I use "may" because I'm not
sure about how badly the unmap failed. But either way, the TD gets
bugged, all vCPUs of the TD are stopped, so the HWpoison-ed page is no
longer "used".

[2] https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/virt/kvm/guest_memfd.c#L334

> It could take an
> #MC for another reason from userspace and the handling code would see the page
> flag is already set. If it doesn't already trip up some MM code somewhere, it
> might put undue burden on the memory failure code to have to expect repeated
> poisoning of the same memory.
>

If it does take another #MC and go to memory_failure(), memory_failure()
already checks for the HWpoison flag being set [3]. This is handled by
killing the process. There is similar handling for a HugeTLB
folio. We're not introducing anything new by using HWpoison; we're
buying into the HWpoison framework, which already handles seeing a
HWpoison when handling a poison.

[3] https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/mm/memory-failure.c#L2270

>> 
>> > What about a kvm_gmem_buggy_cleanup() instead of the system wide one. KVM calls
>> > it and then proceeds to bug the TD only from the KVM side. It's not as safe for
>> > the system, because who knows what a buggy TDX module could do. But TDX module
>> > could also be buggy without the kernel catching wind of it.
>> > 
>> > Having a single callback to basically bug the fd would solve the atomic context
>> > issue. Then guestmemfd could dump the entire fd into memory_failure() instead of
>> > returning the pages. And developers could respond by fixing the bug.
>> > 
>> 
>> This could work too.
>> 
>> I'm in favor of buying into the HWpoison system though, since we're
>> quite sure this is fair use of HWpoison.
>
> Do you mean manually setting the poison flag, or calling into memory_failure(),
> and friends?

I mean manually setting the poison flag.

* If regular 4K page, set the flag.
* If THP page (not (yet) supported by guest_memfd), set the poison flag
  on the specific subpage causing the error, and in addition set THP'S has_hwpoison
  flag
* If HugeTLB page, call folio_set_hugetlb_hwpoison() on the subpage.

This is already the process in memory_failure() and perhaps some
refactoring could be done.

I think calling memory_failure() would do too much, since in addition to
setting the flag, memory_failure() also sometimes does freeing and may
kill processes, and triggers the users of the page to further handle the
HWpoison.

> If we set them manually, we need to make sure that it does not have
> side effects on the machine check handler. It seems risky/messy to me. But
> Kirill didn't seem worried.
>

I believe the memory_failure() is called from the machine check handler:

DEFINE_IDTENTRY_MCE(exc_machine_check)
  -> exc_machine_check_kernel()
     -> do_machine_check()
        -> kill_me_now() or kill_me_maybe()
           -> memory_failure()

(I might have quoted just one of the paths and I'll have to look into it
more.)

For now, IIUC setting the poison flag is a subset of memory_failure(), which is a
subset of what the machine check handler does.

memory_failure() handles an already poisoned page, so I don't see any
side effects.

I'm happy that Kirill didn't seem worried :) Rick, let me know if you
see any specific risks.

> Maybe we could bring the poison page flag up to DavidH and see if there is any
> concern before going down this path too far?
>

I can do that. David's cc-ed on this email, and I hope to get a chance
to talk about handling HWpoison (generally, not TDX specifically) at the
guest_memfd bi-weekly upstream call on 2025-07-10 so I can bring this up
too.

>> 
>> Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
>> on the parts of the folios in trouble?
>
> I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
> VM_BUG_ON() setting vm_dead.

Setting a bool on the fd is a possible option too. Comparing an
inode-level boolean and HWpoison, I still prefer HWpoison because

1. HWpoison gives us more information about which (sub)folio was
   poisoned. We can think of the bool on the fd as an fd-wide
   poisoning. If we don't know which subpage has an error, we're forced
   to leak the entire fd when the inode is released, which could be a
   huge amount of memory leaked.
2. HWpoison is already checked on faults, so there is no need to add an
   extra check on a bool
3. For HugeTLB, HWpoison will have to be summarized/itemized on merge/split to handle
   regular non-TDX related HWpoisons, so no additional code there.

> After an invalidate, if gmem see this, it needs to
> assume everything failed, and invalidate everything and poison all guest memory.
> The point was to have the simplest possible handling for a rare error.

I agree a bool will probably result in fewer lines of code being changed
and could be a fair first cut, but I feel like we would very quickly
need another patch series to get more granular information and not have
to leak an entire fd worth of memory.

Along these lines, Yan seems to prefer setting HWpoison on the entire
folio without going into the details of the exact subfolios being
poisoned. I think this is a possible in-between solution that doesn't
require leaking the entire fd worth of memory, but it still leaks more
than just where the actual error happened.

I'm willing to go with just setting HWpoison on the entire large folio
as a first cut and leak more memory than necessary (because if we don't
know which subpage it is, we are forced to leak everything to be safe).

However, this patch series needs a large page provider in guest_memfd, and
will only land either after THP or HugeTLB support lands in
guest_memfd.

For now if you're testing on guest_memfd+HugeTLB,
folio_set_hugetlb_hwpoison() already exists, why not use it?

> Although
> it's only a proposal. The TDX emergency shutdown option may be simpler still.
> But killing all TDs is not ideal. So thought we could at least consider other
> options.
>
> If we have a solution where TDX needs to do something complicated because
> something of its specialness, it may get NAKed.

Using HWpoison is generic, since guest_memfd needs to handle HWpoison
for regular memory errors anyway. Even if it is not a final solution, it
should be good enough, if not for this patch series to merge, at least
for the next RFC of this patch series. :)

> This is my main concern with the
> direction of this problem/solution. AFAICT, we are not even sure of a concrete
> problem, and it appears to be special to TDX. So the complexity budget should be
> small. It's in sharp contrast to the length of the discussion.