linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGJxU95VvQvQ3bj6@yzhao56-desk.sh.intel.com>
Date: Mon, 30 Jun 2025 19:13:23 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "tabba@...gle.com" <tabba@...gle.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"michael.roth@....com" <michael.roth@....com>, "vbabka@...e.cz"
	<vbabka@...e.cz>, "Yamahata, Isaku" <isaku.yamahata@...el.com>, "Li,
 Zhiquan1" <zhiquan1.li@...el.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Fri, Jun 27, 2025 at 10:59:47AM -0700, Ackerley Tng wrote:
> "Edgecombe, Rick P" <rick.p.edgecombe@...el.com> writes:
> 
> > On Thu, 2025-06-26 at 18:16 +0300, Shutemov, Kirill wrote:
> >> > > Please see my reply to Yan, I'm hoping y'all will agree to something
> >> > > between option f/g instead.
> >> > 
> >> > I'm not sure about the HWPoison approach, but I'm not totally against it. My
> >> > bias is that all the MM concepts are tightly interlinked. If may fit
> >> > perfectly,
> >> > but every new use needs to be checked for how fits in with the other MM
> >> > users of
> >> > it. Every time I've decided a page flag was the perfect solution to my
> >> > problem,
> >> > I got informed otherwise. Let me try to flag Kirill to this discussion. He
> >> > might
> >> > have some insights.
> >> 
> >> We chatted with Rick about this.
> >> 
> >> If I understand correctly, we are discussing the situation where the TDX
> >> module failed to return a page to the kernel.
> >> 
> >> I think it is reasonable to use HWPoison for this case. We cannot
> >> guarantee that we will read back whatever we write to the page. TDX module
> >> has creative ways to corrupt it. 
> >> 
> >> The memory is no longer functioning as memory. It matches the definition
> >> of HWPoison quite closely.
> >
> > ok! Lets go f/g. Unless Yan objects.
I'm ok with f/g. But I have two implementation specific questions:

1. How to set the HWPoison bit in TDX?
2. Should we set this bit for non-guest-memfd pages (e.g. for S-EPT pages) ?

TDX can't invoke memory_failure() on error of removing guest private pages or
S-EPT pages, because holding write mmu_lock is regarded as in atomic context.
As there's a mutex in memory_failure(),
"BUG: sleeping function called from invalid context at kernel/locking/mutex.c"
will be printed.

If TDX invokes memory_failure_queue() instead, looks guest_memfd can invoke
memory_failure_queue_kick() to ensure HWPoison bit is set timely.
But which component could invoke memory_failure_queue_kick() for S-EPT pages?
KVM?


> Follow up as I think about this more: Perhaps we don't need to check for
> HWpoison (or TDX unmap errors) on conversion.
Hmm, yes. HWPoision bit is checked in  __kvm_gmem_get_pfn() and __do_fault().
Looks we don't need to check it on conversion for the purpose of disallowing
shared memory access.

My previous mail was based on another bit and I was not aware of the check of
HWPoison in __do_fault().

The conversion will be successful without checking HWPoision during conversion,
with error log "MCE: Killing ... due to hardware memory corruption fault at ..."
though.

> On a high level, we don't need to check for HWpoison because conversion
> is about changing memory metadata, as in memory privacy status and
> struct folio sizes, and not touching memory contents at all. HWpoison
> means the memory and its contents shouldn't be used.
> 
> Specifically for private-to-shared conversions where the TDX unmap error
> can happen, we will
> 
> 1. HWpoison the page
> 2. Bug the TD
> 
> This falsely successful conversion means the host (guest_memfd) will
> think the memory is shared while it may still be mapped in Secure-EPTs.
> 
> I think that is okay because the only existing user (TD) stops using
> that memory, and no future users can use the memory:
> 
> 1. The TD will be bugged by then. A non-running TD cannot touch memory
>    that had the error on unmapping.
> 
> 2. The page was not mapped into host page tables (since it was
>    private). Even if it were mapped, it will be unmapped from host page
>    tables (host page table unmaps don't fail). If the host tries to
>    touch the memory, on the next fault, core-mm would notice that the
>    page is poisoned and not fault it in.
> 
> By the way, when we "bug the TD", can we assume that ALL vCPUs, not just
> the one that is did the failed unmap will stop running?
Right. All the vCPUs will be kicked out of non-root mode after "bug the VM".

> I guess even if the other vCPUs don't stop running, the TDs vCPUs will
> access the page as shared thinking the conversion succeeded and keep
> hitting #VEs. If the TD accesses the page as private, it's fine since
> the page was not unmapped from Secure-EPTs due to the unmap failure and
> the host cannot write to it (host will see HWpoison on next fault) and
> so there's no host crash and doesn't defeat the purpose of guest_memfd.
> 
> If the guest_memfd with a HWpoisoned page is linked to a new, runnable
> TD, the new TD would need to fault in the page as private. When it tries
> to fault in the page to the new TD, it will hit the HWpoison and
> userspace will get to know about the HWpoison.
I'm Ok with just checking HWPosion on the next fault or dequeue of hugetlb.

> Yan, Rick, let me know what you think of this!