linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f454a913de6233b3afaacac9a2284a8359e3a5b7.camel@intel.com>
Date: Mon, 12 May 2025 21:44:11 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "ackerleytng@...gle.com" <ackerleytng@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Hansen, Dave"
	<dave.hansen@...el.com>, "david@...hat.com" <david@...hat.com>,
	"thomas.lendacky@....com" <thomas.lendacky@....com>, "tabba@...gle.com"
	<tabba@...gle.com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Du, Fan"
	<fan.du@...el.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "Li, Zhiquan1" <zhiquan1.li@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Yamahata,
 Isaku" <isaku.yamahata@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "jroedel@...e.de"
	<jroedel@...e.de>, "Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com"
	<pgonda@...gle.com>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Mon, 2025-05-12 at 12:00 -0700, Ackerley Tng wrote:
> IIUC the question here is how we should handle failures in unmapping of
> private memory, which should be a rare occurrence.
> 
> I think there are two options here
> 
> 1. Fail on unmapping *private* memory
> 
> 2. Don't fail on unmapping *private* memory, instead tell the owner of
>    the memory that this memory is never to be used again.
> 
> I think option 1 is better because it is more direct and provides timely
> feedback to the caller when the issue happens. There is also room to
> provide even more context about the address of the failure here.
> 
> It does seem like generally, unmapping memory does not support failing,
> but I think that is for shared memory (even in KVM MMU notifiers).
> Would it be possible to establish a new contract that for private pages,
> unmapping can fail?
> 
> The kernel/KVM-internal functions for unmapping GFNs can be modified to
> return error when unmapping private memory. Specifically, when
> KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
> error and if not then the caller should not expect failures.
> 
> IIUC the only places where private memory is unmapped now is via
> guest_memfd's truncate and (future) convert operations, so guest_memfd
> can handle those failures or return failure to userspace.

1. Private->shared memory conversion
2. Memslot deletion
3. kvm_gmem_invalidate_begin() callers

> 
> Option 2 is possible too - but seems a little awkward. For conversion
> the general steps are to (a) unmap pages from either host, guest or both
> page tables (b) change shareability status in guest_memfd. It seems
> awkward to first let step (a) pass even though there was an error, and
> then proceed to (b) only to check somewhere (via refcount or otherwise)
> that there was an issue and the conversion needs to fail.
> 
> Currently for private to shared conversions, (will be posting this 1g
> page support series (with conversions) soon), I check refcounts == safe
> refcount for shared to private conversions before permitting conversions
> (error returned to userspace on failure).
> 
> For private to shared conversions, there is no check. At conversion
> time, when splitting pages, I just spin in the kernel waiting for any
> speculative refcounts to drop to go away. The refcount check at
> conversion time is currently purely to ensure a safe merge process.
> 
> It is possible to check all the refcounts of private pages (split or
> huge page) in the requested conversion range to handle unmapping
> failures, but that seems expensive to do for every conversion, for
> possibly many 4K pages, just to find a rare error case.
> 
> Also, if we do this refcount check to find the error, there wouldn't be
> any way to tell if it were an error or if it was a speculative refcount,
> so guest_memfd would just have to return -EAGAIN for private to shared
> conversions. This would make conversions complicated to handle in
> userspace, since the userspace VMM doesn't know whether it should retry
> (for speculative refcounts) or it should give up because of the
> unmapping error. Returning a different error on unmapping failure would
> allow userspace to handle the two cases differently.
> 
> Regarding Option 2, another way to indicate an error could be to mark
> the page as poisoned, but then again that would overlap/shadow true
> memory poisoning.
> 
> In summary, I think Option 1 is best, which is that we return error
> within the kernel, and the caller (for now only guest_memfd unmaps
> private memory) should handle the error.

When we get to huge pages we will have two error conditions on the KVM side:
1. Fail to split
2. A TDX module error

For TDX module errors, today we are essentially talking about bugs. The handling
in the case of TDX module bug should be to bug the VM and to prevent the memory
from being freed to the kernel. In which case the unmapping sort of succeeded,
albeit destructively.

So I'm not sure if userspace needs to know about the TDX module bugs (they are
going to find out anyway on the next KVM ioctl). But if we plumbed the error
code all the way through to guestmemfd, then I guess why not tell them.

On whether we should go to the trouble, could another option be to expose a
guestmemfd function that allows for "poisoning" the memory, and have the TDX bug
paths call it. This way guestmemfd could know specifically about unrecoverable
errors. For splitting failures, we can return an error to guestmemfd without
plumbing the error code all the way through.

Perhaps it might be worth doing a quick POC to see how bad plumbing the error
code all the way looks.