linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz1pqq5qio.fsf@ackerleytng-ctop.c.googlers.com>
Date: Tue, 08 Jul 2025 14:19:59 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, 
	"Shutemov, Kirill" <kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>, 
	"david@...hat.com" <david@...hat.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>, 
	"vbabka@...e.cz" <vbabka@...e.cz>, "tabba@...gle.com" <tabba@...gle.com>, 
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "michael.roth@....com" <michael.roth@....com>, 
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, 
	"Peng, Chao P" <chao.p.peng@...el.com>, 
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, 
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Miao, Jun" <jun.miao@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pgonda@...gle.com" <pgonda@...gle.com>, 
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:

> On Wed, 2025-07-02 at 13:57 -0700, Ackerley Tng wrote:
>> > 
>> > If a poisoned page continued to be used, it's a bit weird, no? 
>> 
>> Do you mean "continued to be used" in the sense that it is present in a
>> filemap and belongs to a (guest_memfd) inode?
>
> I mean anyway where it might get read or written to again.
>

Today, when handling memory failures, guest_memfd will unmap the page
from the guest and on the next fault, guest_memfd will discover the
HWpoison flag and return -EHWPOISON for KVM to handle.

If we go with my proposal, on TDX unmap failure, KVM kills the TD and
sets -EHWPOISON on the page.

Hence, the TD will not read/write the poisoned page.

TDX unmap failure implies that the page is private, so it will not be
mapped into the host page tables. If it is not in the host page tables,
the next access from the host will cause a page fault, and that's when
the HWpoison will be discoverd.

Hence, the host will also not read/write the poisoned page. Let me know
if you see any way else the poisoned page can be read/written to again.

>> 
>> A poisoned page is not faulted in anywhere, and in that sense the page
>> is not "used". In the case of regular poisoning as in a call to
>> memory_failure(), the page is unmapped from the page tables. If that
>> page belongs to guest_memfd, in today's code [2], guest_memfd
>> intentionally does not truncate it from the filemap. For guest_memfd,
>> handling the HWpoison at fault time is by design; keeping it present in
>> the filemap is by design.
>
> I thought I read that you would allow it to be re-used. I see that the code
> already checks for poison in the kvm_gmem_get_pfn() path and the mmap() path. So
> it will just sit in the fd and not be handed out again. I think it's ok. Well,
> as long as conversion to shared doesn't involve zeroing...?
>

IIUC it is zeroed on unmapping from the guest page tables? Is that done
by the TDX module, or by TDX code in KVM? Either way I think both of
those should be stopped once the unmap failure is discovered, as part of
"killing the TD".

>> 
>> In the case of TDX unmap failures leading to HWpoison, the only place it
>> may remain mapped is in the Secure-EPTs. I use "may" because I'm not
>> sure about how badly the unmap failed. But either way, the TD gets
>> bugged, all vCPUs of the TD are stopped, so the HWpoison-ed page is no
>> longer "used".
>> 
>> [2]
>> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/virt/kvm/guest_memfd.c#L334
>
> Yes, I saw that. It looks like special error case treatment for the state we are
> setting up.
>
>> 
>> > It could take an
>> > #MC for another reason from userspace and the handling code would see the
>> > page
>> > flag is already set. If it doesn't already trip up some MM code somewhere,
>> > it
>> > might put undue burden on the memory failure code to have to expect repeated
>> > poisoning of the same memory.
>> > 
>> 
>> If it does take another #MC and go to memory_failure(), memory_failure()
>> already checks for the HWpoison flag being set [3]. This is handled by
>> killing the process. There is similar handling for a HugeTLB
>> folio. We're not introducing anything new by using HWpoison; we're
>> buying into the HWpoison framework, which already handles seeing a
>> HWpoison when handling a poison.
>
> Do you see another user that is setting the poison flag manually like proposed?
> (i.e. not through memory failure handlers)
>

As far as I know, this might be the first case of setting the poison
flag not through memory failure handlers.

>> 
>> [3]
>> https://github.com/torvalds/linux/blob/b4911fb0b060899e4eebca0151eb56deb86921ec/mm/memory-failure.c#L2270
>> 
>> > > 
>> > > > What about a kvm_gmem_buggy_cleanup() instead of the system wide one.
>> > > > KVM calls
>> > > > it and then proceeds to bug the TD only from the KVM side. It's not as
>> > > > safe for
>> > > > the system, because who knows what a buggy TDX module could do. But TDX
>> > > > module
>> > > > could also be buggy without the kernel catching wind of it.
>> > > > 
>> > > > Having a single callback to basically bug the fd would solve the atomic
>> > > > context
>> > > > issue. Then guestmemfd could dump the entire fd into memory_failure()
>> > > > instead of
>> > > > returning the pages. And developers could respond by fixing the bug.
>> > > > 
>> > > 
>> > > This could work too.
>> > > 
>> > > I'm in favor of buying into the HWpoison system though, since we're
>> > > quite sure this is fair use of HWpoison.
>> > 
>> > Do you mean manually setting the poison flag, or calling into
>> > memory_failure(),
>> > and friends?
>> 
>> I mean manually setting the poison flag.
>> 
>> * If regular 4K page, set the flag.
>> * If THP page (not (yet) supported by guest_memfd), set the poison flag
>>   on the specific subpage causing the error, and in addition set THP'S
>> has_hwpoison
>>   flag
>> * If HugeTLB page, call folio_set_hugetlb_hwpoison() on the subpage.
>> 
>> This is already the process in memory_failure() and perhaps some
>> refactoring could be done.
>> 
>> I think calling memory_failure() would do too much, since in addition to
>> setting the flag, memory_failure() also sometimes does freeing and may
>> kill processes, and triggers the users of the page to further handle the
>> HWpoison.
>
> It definitely seem like there is more involved than setting the flag. Which
> means for our case we should try to understand what we are skipping and how it
> fits with the rest of the kernel. Is any code the checks for poison assuming
> that memory_failure() stuff has been done? Stuff like that.
>

Yup! But I do still think setting HWpoison is good enough to pursue at
least for a next RFC patch series, and in the process of testing that
series we could learn more. Do you mean that we shouldn't proceed until
all of this is verified?

>> 
>> > If we set them manually, we need to make sure that it does not have
>> > side effects on the machine check handler. It seems risky/messy to me. But
>> > Kirill didn't seem worried.
>> > 
>> 
>> I believe the memory_failure() is called from the machine check handler:
>> 
>> DEFINE_IDTENTRY_MCE(exc_machine_check)
>>   -> exc_machine_check_kernel()
>>      -> do_machine_check()
>>         -> kill_me_now() or kill_me_maybe()
>>            -> memory_failure()
>> 
>> (I might have quoted just one of the paths and I'll have to look into it
>> more.)
>
> It looked that way to me too. But it works from other contexts. See
> MADV_HWPOISON (which is for testing).
>
>> 
>> For now, IIUC setting the poison flag is a subset of memory_failure(), which
>> is a
>> subset of what the machine check handler does.
>> 
>> memory_failure() handles an already poisoned page, so I don't see any
>> side effects.
>> 
>> I'm happy that Kirill didn't seem worried :) Rick, let me know if you
>> see any specific risks.
>> 
>> > Maybe we could bring the poison page flag up to DavidH and see if there is
>> > any
>> > concern before going down this path too far?
>> > 
>> 
>> I can do that. David's cc-ed on this email, and I hope to get a chance
>> to talk about handling HWpoison (generally, not TDX specifically) at the
>> guest_memfd bi-weekly upstream call on 2025-07-10 so I can bring this up
>> too.
>
> Ok sounds good. Should we just continue the discussion there?

I think we're at a point where further discussion isn't really
useful. Kirill didn't seem worried about using HWpoison, so that's a
good sign. I think we can go ahead to use HWpoison for the next RFC of
this series and we might learn more through the process of testing it.

Do you prefer to just wait till the next guest_memfd call (now
rescheduled to 2025-07-17) before proceeding?

> I can try to
> attend.
>

Sure, thanks! It'll be focused on memory failure handling in general, so
TDX will just be another participant.

>> 
>> > > 
>> > > Are you saying kvm_gmem_buggy_cleanup() will just set the HWpoison flag
>> > > on the parts of the folios in trouble?
>> > 
>> > I was saying kvm_gmem_buggy_cleanup() can set a bool on the fd, similar to
>> > VM_BUG_ON() setting vm_dead.
>> 
>> Setting a bool on the fd is a possible option too. Comparing an
>> inode-level boolean and HWpoison, I still prefer HWpoison because
>> 
>> 1. HWpoison gives us more information about which (sub)folio was
>>    poisoned. We can think of the bool on the fd as an fd-wide
>>    poisoning. If we don't know which subpage has an error, we're forced
>>    to leak the entire fd when the inode is released, which could be a
>>    huge amount of memory leaked.
>> 2. HWpoison is already checked on faults, so there is no need to add an
>>    extra check on a bool
>> 3. For HugeTLB, HWpoison will have to be summarized/itemized on merge/split to
>> handle
>>    regular non-TDX related HWpoisons, so no additional code there.
>> 
>> > After an invalidate, if gmem see this, it needs to
>> > assume everything failed, and invalidate everything and poison all guest
>> > memory.
>> > The point was to have the simplest possible handling for a rare error.
>> 
>> I agree a bool will probably result in fewer lines of code being changed
>> and could be a fair first cut, but I feel like we would very quickly
>> need another patch series to get more granular information and not have
>> to leak an entire fd worth of memory.
>
> We will only leak an entire VMs worth of memory if there is a bug, the form of
> which I'm not sure. The kernel doesn't usually have a lot of defensive code to
> handle for bugs elsewhere. Unless it's to help debugging. But especially for
> other platform software (bios, etc), it should try to stay out of the job of
> maintaining code to work around unfixed bugs. And here we are working around
> *potential bugs*.
>
> So another *possible* solution is to expect TDX module/KVM to work. Kill the TD,
> return success to the invalidation, and hope that it doesn't do anything to
> those zombie mappings. It will likely work. Probably much more likely to work
> then some other warning cases in the kernel. As far as debugging, if strange
> crashes are observed after a bit splat, it can be a good hint.
>
> Unless Yan has some specific case to worry about that she has been holding on to
> that makes this error condition a more expected state. That could change things.
>
>> 
>> Along these lines, Yan seems to prefer setting HWpoison on the entire
>> folio without going into the details of the exact subfolios being
>> poisoned. I think this is a possible in-between solution that doesn't
>> require leaking the entire fd worth of memory, but it still leaks more
>> than just where the actual error happened.
>> 
>> I'm willing to go with just setting HWpoison on the entire large folio
>> as a first cut and leak more memory than necessary (because if we don't
>> know which subpage it is, we are forced to leak everything to be safe).
>
> Leaking more memory than necessary in a bug case seems totally ok to me.
>
>> 
>> However, this patch series needs a large page provider in guest_memfd, and
>> will only land either after THP or HugeTLB support lands in
>> guest_memfd.
>> 
>> For now if you're testing on guest_memfd+HugeTLB,
>> folio_set_hugetlb_hwpoison() already exists, why not use it?
>> 
>> > Although
>> > it's only a proposal. The TDX emergency shutdown option may be simpler
>> > still.
>> > But killing all TDs is not ideal. So thought we could at least consider
>> > other
>> > options.
>> > 
>> > If we have a solution where TDX needs to do something complicated because
>> > something of its specialness, it may get NAKed.
>> 
>> Using HWpoison is generic, since guest_memfd needs to handle HWpoison
>> for regular memory errors anyway. Even if it is not a final solution, it
>> should be good enough, if not for this patch series to merge, at least
>> for the next RFC of this patch series. :)
>
> Yes, maybe. If we have a normal, easy, non-imposing solution for handling the
> error then I won't object.

I believe we have 1 solution now, with 4 options to prevent the memory
from being re-used by the host.

1. Kill the TD *and* one of the following to prevent the memory from
   being re-used by the host:
    a. Kill the host
    b. HWpoison the memory
    c. fd bool, aka inode-wide HWpoison on error
    d. Leak the memory directly (not great, will mess up conversions and
                                 guest_memfd inode release) 

To push along on this topic, is it okay for us to proceed with HWpoison
and find out along the way if it is not easy or imposing?