linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz7c2lr6wg.fsf@ackerleytng-ctop.c.googlers.com>
Date: Mon, 12 May 2025 12:00:31 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: vannapurve@...gle.com, pbonzini@...hat.com, seanjc@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Yan Zhao <yan.y.zhao@...el.com> writes:

>> <snip>
>>
>> Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
>> >
>> > That's why we rely on increasing folio ref count to reflect failure, which are
>> > due to unexpected SEAMCALL errors.
>>
>> TDX stack is adding a scenario where invalidation can fail, a cleaner
>> solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
>> Another option is to notify guest_memfd out of band to convey the
>> ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
>> With in-place conversion supported, even if the refcount is raised for
>> such pages, they can still get used by the host if the guest_memfd is
>> unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

IIUC the question here is how we should handle failures in unmapping of
private memory, which should be a rare occurrence.

I think there are two options here

1. Fail on unmapping *private* memory

2. Don't fail on unmapping *private* memory, instead tell the owner of
   the memory that this memory is never to be used again.

I think option 1 is better because it is more direct and provides timely
feedback to the caller when the issue happens. There is also room to
provide even more context about the address of the failure here.

It does seem like generally, unmapping memory does not support failing,
but I think that is for shared memory (even in KVM MMU notifiers).
Would it be possible to establish a new contract that for private pages,
unmapping can fail?

The kernel/KVM-internal functions for unmapping GFNs can be modified to
return error when unmapping private memory. Specifically, when
KVM_FILTER_PRIVATE [1] is set, then the unmapping function can return an
error and if not then the caller should not expect failures.

IIUC the only places where private memory is unmapped now is via
guest_memfd's truncate and (future) convert operations, so guest_memfd
can handle those failures or return failure to userspace.

Option 2 is possible too - but seems a little awkward. For conversion
the general steps are to (a) unmap pages from either host, guest or both
page tables (b) change shareability status in guest_memfd. It seems
awkward to first let step (a) pass even though there was an error, and
then proceed to (b) only to check somewhere (via refcount or otherwise)
that there was an issue and the conversion needs to fail.

Currently for private to shared conversions, (will be posting this 1g
page support series (with conversions) soon), I check refcounts == safe
refcount for shared to private conversions before permitting conversions
(error returned to userspace on failure).

For private to shared conversions, there is no check. At conversion
time, when splitting pages, I just spin in the kernel waiting for any
speculative refcounts to drop to go away. The refcount check at
conversion time is currently purely to ensure a safe merge process.

It is possible to check all the refcounts of private pages (split or
huge page) in the requested conversion range to handle unmapping
failures, but that seems expensive to do for every conversion, for
possibly many 4K pages, just to find a rare error case.

Also, if we do this refcount check to find the error, there wouldn't be
any way to tell if it were an error or if it was a speculative refcount,
so guest_memfd would just have to return -EAGAIN for private to shared
conversions. This would make conversions complicated to handle in
userspace, since the userspace VMM doesn't know whether it should retry
(for speculative refcounts) or it should give up because of the
unmapping error. Returning a different error on unmapping failure would
allow userspace to handle the two cases differently.

Regarding Option 2, another way to indicate an error could be to mark
the page as poisoned, but then again that would overlap/shadow true
memory poisoning.

In summary, I think Option 1 is best, which is that we return error
within the kernel, and the caller (for now only guest_memfd unmaps
private memory) should handle the error.

[1] https://github.com/torvalds/linux/blob/627277ba7c2398dc4f95cc9be8222bb2d9477800/include/linux/kvm_host.h#L260

>
>> >
>> > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
>> > > > that fails to be unmapped.
>> > > >
>> > > >
>> > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
>> > > >
>> > > >
>> > > > > >
>> > > > > >
>> > > > > > > Any guest_memfd range updates will result in invalidations/updates of
>> > > > > > > userspace, guest, IOMMU or any other page tables referring to
>> > > > > > > guest_memfd backed pfns. This story will become clearer once the
>> > > > > > > support for PFN range allocator for backing guest_memfd starts getting
>> > > > > > > discussed.
>> > > > > > Ok. It is indeed unclear right now to support such kind of memory.
>> > > > > >
>> > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
>> > > > > > into private EPT until TDX connect.
>> > > > >
>> > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
>> > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
>> > > > > connect/Sev TIO, major difference would be that guest_memfd private
>> > > > > ranges will be mapped into IOMMU page tables.
>> > > > >
>> > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
>> > > > > have been discussions on not using page structs for private memory
>> > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
>> > > > > seamless merge/split story for private hugepages to support memory
>> > > > > conversion. So I think the general direction we should head towards is
>> > > > > not relying on refcounts for guest_memfd private ranges and/or page
>> > > > > structs altogether.
>> > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
>> > > > keep all necessary info.
>> > > >
>> > >
>> > > Story will become clearer once VM_PFNMAP'd memory support starts
>> > > getting discussed. In case of guest_memfd, there is flexibility to
>> > > store metadata for physical ranges within guest_memfd just like
>> > > shareability tracking.
>> > Ok.
>> >
>> > > >
>> > > > > I think the series [2] to work better with PFNMAP'd physical memory in
>> > > > > KVM is in the very right direction of not assuming page struct backed
>> > > > > memory ranges for guest_memfd as well.
>> > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
>> > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
>> > > >
>> > > >
>> > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
>> > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
>> > > > >
>> > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
>> > > > > > driver is VFIO pci driver rather than guest_memfd.
>> > > > >
>> > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
>> > > > > by VM_PFNMAP memory.
>> > > > >
>> > > > > >
>> > > > > >
>> > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
>> > >