linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH8=-70DU2e52OJe=w0HfuW5Zg6wGHV32FWD_hQzYBa=fA@mail.gmail.com>
Date: Fri, 9 May 2025 07:20:30 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: pbonzini@...hat.com, seanjc@...gle.com, linux-kernel@...r.kernel.org, 
	kvm@...r.kernel.org, x86@...nel.org, rick.p.edgecombe@...el.com, 
	dave.hansen@...el.com, kirill.shutemov@...el.com, tabba@...gle.com, 
	ackerleytng@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Thu, May 8, 2025 at 8:22 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Thu, May 08, 2025 at 07:10:19AM -0700, Vishal Annapurve wrote:
> > On Wed, May 7, 2025 at 6:32 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > >
> > > On Wed, May 07, 2025 at 07:56:08AM -0700, Vishal Annapurve wrote:
> > > > On Wed, May 7, 2025 at 12:39 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > >
> > > > > On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> > > > > > On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > > > > > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > > > >
> > > > > > > > > Sorry for the late reply, I was on leave last week.
> > > > > > > > >
> > > > > > > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > > > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > > > > > > folio_ref_add() in the event of a removal failure.
> > > > > > > > > >
> > > > > > > > > > In my opinion, the above scheme can be deployed with this series
> > > > > > > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > > > > > > I initially intended to add a separate patch at the end of this series to
> > > > > > > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > > > > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > > > > > > conversion.
> > > > > > > > >
> > > > > > > > > We can include it in the next version If you think it's better.
> > > > > > > >
> > > > > > > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > > > > > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > > > > > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > > > > > > logic.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > > > > > > page structs, but that problem can be solved in future possibly by
> > > > > > > > > With current TDX code, all memory must be backed by a page struct.
> > > > > > > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > > > > > > than a pfn.
> > > > > > > > >
> > > > > > > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > > > > > > invalidation completes.
> > > > > > > > > A curious question:
> > > > > > > > > To support memory not backed by page structs in future, is there any counterpart
> > > > > > > > > to the page struct to hold ref count and map count?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > > > > > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > > > > > > memory ranges as all users will be notified when mappings are
> > > > > > > > changed/removed.
> > > > > > > So, it's possible to map such memory in both shared and private EPT
> > > > > > > simultaneously?
> > > > > >
> > > > > > No, guest_memfd will still ensure that userspace can only fault in
> > > > > > shared memory regions in order to support CoCo VM usecases.
> > > > > Before guest_memfd converts a PFN from shared to private, how does it ensure
> > > > > there are no shared mappings? e.g., in [1], it uses the folio reference count
> > > > > to ensure that.
> > > > >
> > > > > Or do you believe that by eliminating the struct page, there would be no
> > > > > GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
> > > > > response to a guest_memfd invalidation notification?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
> > > > > no need to register mmu notifier. So why users like VFIO must register
> > > > > guest_memfd invalidation notification?
> > > >
> > > > VM_PFNMAP'd memory can't be long term pinned, so users of such memory
> > > > ranges will have to adopt mechanisms to get notified. I think it would
> > > Hmm, in current VFIO, it does not register any notifier for VM_PFNMAP'd memory.
> >
> > I don't completely understand how VM_PFNMAP'd memory is used today for
> > VFIO. Maybe only MMIO regions are backed by pfnmap today and the story
> > for normal memory backed by pfnmap is yet to materialize.
> VFIO can fault in VM_PFNMAP'd memory which is not from MMIO regions. It works
> because it knows VM_PFNMAP'd memory are always pinned.
>
> Another example is udmabuf (drivers/dma-buf/udmabuf.c), it mmaps normal folios
> with VM_PFNMAP flag without registering mmu notifier because those folios are
> pinned.
>

I might be wrongly throwing out some terminologies here then.
VM_PFNMAP flag can be set for memory backed by folios/page structs.
udmabuf seems to be working with pinned "folios" in the backend.

The goal is to get to a stage where guest_memfd is backed by pfn
ranges unmanaged by kernel that guest_memfd owns and distributes to
userspace, KVM, IOMMU subject to shareability attributes. if the
shareability changes, the users will get notified and will have to
invalidate their mappings. guest_memfd will allow mmaping such ranges
with VM_PFNMAP flag set by default in the VMAs to indicate the need of
special handling/lack of page structs.

As an intermediate stage, it makes sense to me to just not have
private memory backed by page structs and use a special "filemap" to
map file offsets to these private memory ranges. This step will also
need similar contract with users -
   1) memory is pinned by guest_memfd
   2) users will get invalidation notifiers on shareability changes

I am sure there is a lot of work here and many quirks to be addressed,
let's discuss this more with better context around. A few related RFC
series are planned to be posted in the near future.

> > >
> > > > be easy to pursue new users of guest_memfd to follow this scheme.
> > > > Irrespective of whether VM_PFNMAP'd support lands, guest_memfd
> > > > hugepage support already needs the stance of: "Guest memfd owns all
> > > > long-term refcounts on private memory" as discussed at LPC [1].
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
> > > > (slide 12)
> > > >
> > > > >
> > > > > Besides, how would guest_memfd handle potential unmap failures? e.g. what
> > > > > happens to prevent converting a private PFN to shared if there are errors when
> > > > > TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.
> > > >
> > > > Users will have to signal such failures via the invalidation callback
> > > > results or other appropriate mechanisms. guest_memfd can relay the
> > > > failures up the call chain to the userspace.
> > > AFAIK, operations that perform actual unmapping do not allow failure, e.g.
> > > kvm_mmu_unmap_gfn_range(), iopt_area_unfill_domains(),
> > > vfio_iommu_unmap_unpin_all(), vfio_iommu_unmap_unpin_reaccount().
> >
> > Very likely because these operations simply don't fail.
>
> I think they are intentionally designed to be no-fail.
>
> e.g. in __iopt_area_unfill_domain(), no-fail is achieved by using a small backup
> buffer allocated on stack in case of kmalloc() failure.
>
>
> > >
> > > That's why we rely on increasing folio ref count to reflect failure, which are
> > > due to unexpected SEAMCALL errors.
> >
> > TDX stack is adding a scenario where invalidation can fail, a cleaner
> > solution would be to propagate the result as an invalidation failure.
> Not sure if linux kernel accepts unmap failure.
>
> > Another option is to notify guest_memfd out of band to convey the
> > ranges that failed invalidation.
> Yes, this might be better. Something similar like holding folio ref count to
> let guest_memfd know that a certain PFN cannot be re-assigned.
>
> > With in-place conversion supported, even if the refcount is raised for
> > such pages, they can still get used by the host if the guest_memfd is
> > unaware that the invalidation failed.
> I thought guest_memfd should check if folio ref count is 0 (or a base count)
> before conversion, splitting or re-assignment. Otherwise, why do you care if
> TDX holds the ref count? :)
>

Soon to be posted RFC series by Ackerley currently explicitly checks
for safe private page refcounts when folio splitting is needed and not
for every private to shared conversion. A simple solution would be for
guest_memfd to check safe page refcounts for each private to shared
conversion even if split is not required but will need to be reworked
when either of the stages discussed above land where page structs are
not around.

>
> > >
> > > > > Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
> > > > > that fails to be unmapped.
> > > > >
> > > > >
> > > > > [1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Any guest_memfd range updates will result in invalidations/updates of
> > > > > > > > userspace, guest, IOMMU or any other page tables referring to
> > > > > > > > guest_memfd backed pfns. This story will become clearer once the
> > > > > > > > support for PFN range allocator for backing guest_memfd starts getting
> > > > > > > > discussed.
> > > > > > > Ok. It is indeed unclear right now to support such kind of memory.
> > > > > > >
> > > > > > > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > > > > > > into private EPT until TDX connect.
> > > > > >
> > > > > > There is a plan to use VM_PFNMAP memory for all of guest_memfd
> > > > > > shared/private ranges orthogonal to TDX connect usecase. With TDX
> > > > > > connect/Sev TIO, major difference would be that guest_memfd private
> > > > > > ranges will be mapped into IOMMU page tables.
> > > > > >
> > > > > > Irrespective of whether/when VM_PFNMAP memory support lands, there
> > > > > > have been discussions on not using page structs for private memory
> > > > > > ranges altogether [1] even with hugetlb allocator, which will simplify
> > > > > > seamless merge/split story for private hugepages to support memory
> > > > > > conversion. So I think the general direction we should head towards is
> > > > > > not relying on refcounts for guest_memfd private ranges and/or page
> > > > > > structs altogether.
> > > > > It's fine to use PFN, but I wonder if there're counterparts of struct page to
> > > > > keep all necessary info.
> > > > >
> > > >
> > > > Story will become clearer once VM_PFNMAP'd memory support starts
> > > > getting discussed. In case of guest_memfd, there is flexibility to
> > > > store metadata for physical ranges within guest_memfd just like
> > > > shareability tracking.
> > > Ok.
> > >
> > > > >
> > > > > > I think the series [2] to work better with PFNMAP'd physical memory in
> > > > > > KVM is in the very right direction of not assuming page struct backed
> > > > > > memory ranges for guest_memfd as well.
> > > > > Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
> > > > > hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".
> > > > >
> > > > >
> > > > > > [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> > > > > > [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> > > > > >
> > > > > > > And even in that scenario, the memory is only for private MMIO, so the backend
> > > > > > > driver is VFIO pci driver rather than guest_memfd.
> > > > > >
> > > > > > Not necessary. As I mentioned above guest_memfd ranges will be backed
> > > > > > by VM_PFNMAP memory.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543
> > > >