[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzms9pjaki.fsf@ackerleytng-ctop.c.googlers.com>
Date: Mon, 30 Jun 2025 12:25:49 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Zhao, Yan Y" <yan.y.zhao@...el.com>
Cc: "Shutemov, Kirill" <kirill.shutemov@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com" <thomas.lendacky@....com>,
"tabba@...gle.com" <tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>,
"quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
"Peng, Chao P" <chao.p.peng@...el.com>, "Du, Fan" <fan.du@...el.com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"Li, Zhiquan1" <zhiquan1.li@...el.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:
> On Mon, 2025-06-30 at 19:13 +0800, Yan Zhao wrote:
>> > > ok! Lets go f/g. Unless Yan objects.
>> I'm ok with f/g. But I have two implementation specific questions:
>>
>> 1. How to set the HWPoison bit in TDX?
I was thinking to set the HWpoison flag based on page type. If regular
4K page, set the flag. If THP page (not (yet) supported by guest_memfd),
set the has_hwpoison flag, and if HugeTLB page, call
folio_set_hugetlb_hwpoison().
But if we go with Rick's suggestion below, then we don't have to figure
this out.
>> 2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?
>
> Argh, I guess we can keep the existing ref count based approach for the other
> types of TDX owned pages?
>
Wait TDX can only use guest_memfd pages, right? Even if TDX can use
non-guest_memfd pages, why not also set HWpoison for non-guest_memfd
pages?
Either way I guess if we go with Rick's suggestion below, then we don't
have to figure the above out.
>>
>> TDX can't invoke memory_failure() on error of removing guest private pages or
>> S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
>> As there's a mutex in memory_failure(),
>> "BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
>> will be printed.
>>
>> If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
>> memory_failure_queue_kick() to ensure HWPoison bit is set timely.
>> But which component could invoke memory_failure_queue_kick() for S-EPT pages?
>> KVM?
>
> Hmm, it only has queue of 10 pages per-cpu. If something goes wrong in the TDX
> module, I could see exceeding this during a zap operation. At which point, how
> much have we really handled it?
>
>
> But, at the risk of derailing the solution when we are close, some reflection
> has made me question whether this is all misprioritized. We are trying to handle
> a case where a TDX module bug may return an error when we try to release gmem
> pages. For that, this solution is feeling way too complex.
>
> If there is a TDX module bug, a simpler way to handle it would be to fix the
> bug. In the meantime the kernel can take simpler, more drastic efforts to
> reclaim the memory and ensure system stability.
>
> In the host kexec patches we need to handle a kexec while the TDX module is
> running. The solution is to simply wbinvd on each pCPU that might have entered
> the TDX module. After that, barring no new SEAMCALLs that could dirty
> memory, the pages are free to use by the next kernel. (at least on systems
> without the partial write errata)
>
> So for this we can do something similar. Have the arch/x86 side of TDX grow a
> new tdx_buggy_shutdown(). Have it do an all-cpu IPI to kick CPUs out of
> SEAMMODE, wbivnd, and set a "no more seamcalls" bool. Then any SEAMCALLs after
> that will return a TDX_BUGGY_SHUTDOWN error, or similar. All TDs in the system
> die. Zap/cleanup paths return success in the buggy shutdown case.
>
Do you mean that on unmap/split failure: there is a way to make 100%
sure all memory becomes re-usable by the rest of the host, using
tdx_buggy_shutdown(), wbinvd, etc?
If yes, then I'm onboard with this, and if we are 100% sure all memory
becomes re-usable by the host after all the extensive cleanup, then we
don't need to HWpoison anything.
> Does it fit? Or, can you guys argue that the failures here are actually non-
> special cases that are worth more complex recovery? I remember we talked about
> IOMMU patterns that are similar, but it seems like the remaining cases under
> discussion are about TDX bugs.
Powered by blists - more mailing lists