linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9hwj7BvSm4DgRkHmdPnmi-1-FMH5Z7xK1VBh=s4W8VYA@mail.gmail.com>
Date: Thu, 8 May 2025 07:10:19 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > >
> > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > >
> > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > >
> > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > conversion.
> > > > > > >
> > > > > > > We can include it in the next version If you think it's better.
> > > > > >
> > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > logic.
> > > > > >
> > > > > > >
> > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > than a pfn.
> > > > > > >
> > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > invalidation completes.
> > > > > > > A curious question:
> > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > to the page struct to hold ref count and map count?
> > > > > > >
> > > > > >
> > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > memory ranges as all users will be notified when mappings are
> > > > > > changed/removed.
> > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > simultaneously?
> > > >
> > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > shared memory regions in order to support CoCo VM usecases.
> > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > to ensure that.
> > >
> > > Or do you believe that by eliminating the struct page, there would be no
> > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > response to a guest_memfd invalidation notification?
> >
> > Yes.
> >
> > >
> > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > no need to register mmu notifier. So why users like VFIO must register
> > > guest_memfd invalidation notification?
> >
> > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > ranges will have to adopt mechanisms to get notified. I think it would
> Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.

I don't completely understand how VM_PFNMAP'd memory is used today for
VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
for normal memory backed by pfnmap is yet to materialize.

>
> > be easy to pursue new users of guest_memfd to follow this scheme.
> > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > hugepage support already needs the stance of: "Guest memfd owns all
> > long-term refcounts on private memory" as discussed at LPC [1].
> >
> > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > (slide 12)
> >
> > >
> > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > happens to prevent converting a private PFN to shared if there are errors when
> > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> >
> > Users will have to signal such failures via the invalidation callback
> > results or other appropriate mechanisms. guest_memfd can relay the
> > failures up the call chain to the userspace.
> AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().

Very likely because these operations simply don't fail.

>
> That's why we rely on increasing folio ref count to reflect failure, which are
> due to unexpected SEAMCALL errors.

TDX stack is adding a scenario where invalidation can fail, a cleaner
solution would be to propagate the result as an invalidation failure.
Another option is to notify guest_memfd out of band to convey the
ranges that failed invalidation.

With in-place conversion supported, even if the refcount is raised for
such pages, they can still get used by the host if the guest_memfd is
unaware that the invalidation failed.

>
> > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > that fails to be unmapped.
> > >
> > >
> > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > >
> > >
> > > > >
> > > > >
> > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > discussed.
> > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > >
> > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > into private EPT until TDX connect.
> > > >
> > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > ranges will be mapped into IOMMU page tables.
> > > >
> > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > have been discussions on not using page structs for private memory
> > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > seamless merge/split story for private hugepages to support memory
> > > > conversion. So I think the general direction we should head towards is
> > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > structs altogether.
> > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > keep all necessary info.
> > >
> >
> > Story will become clearer once VM_PFNMAP'd memory support starts
> > getting discussed. In case of guest_memfd, there is flexibility to
> > store metadata for physical ranges within guest_memfd just like
> > shareability tracking.
> Ok.
>
> > >
> > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > KVM is in the very right direction of not assuming page struct backed
> > > > memory ranges for guest_memfd as well.
> > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > >
> > >
> > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > >
> > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > driver is VFIO pci driver rather than guest_memfd.
> > > >
> > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > by VM_PFNMAP memory.
> > > >
> > > > >
> > > > >
> > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> >