linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aBsNsZsWuVl4uo0j@yzhao56-desk.sh.intel.com>
Date: Wed, 7 May 2025 15:37:21 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <ackerleytng@...gle.com>,
	<quic_eberman@...cinc.com>, <michael.roth@....com>, <david@...hat.com>,
	<vbabka@...e.cz>, <jroedel@...e.de>, <thomas.lendacky@....com>,
	<pgonda@...gle.com>, <zhiquan1.li@...el.com>, <fan.du@...el.com>,
	<jun.miao@...el.com>, <ira.weiny@...el.com>, <isaku.yamahata@...el.com>,
	<xiaoyao.li@...el.com>, <binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Tue, May 06, 2025 at 06:18:55AM -0700, Vishal Annapurve wrote:
> On Mon, May 5, 2025 at 11:07 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > On Mon, May 05, 2025 at 10:08:24PM -0700, Vishal Annapurve wrote:
> > > On Mon, May 5, 2025 at 5:56 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > >
> > > > Sorry for the late reply, I was on leave last week.
> > > >
> > > > On Tue, Apr 29, 2025 at 06:46:59AM -0700, Vishal Annapurve wrote:
> > > > > On Mon, Apr 28, 2025 at 5:52 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> > > > > > So, we plan to remove folio_ref_add()/folio_put_refs() in future, only invoking
> > > > > > folio_ref_add() in the event of a removal failure.
> > > > >
> > > > > In my opinion, the above scheme can be deployed with this series
> > > > > itself. guest_memfd will not take away memory from TDX VMs without an
> > > > I initially intended to add a separate patch at the end of this series to
> > > > implement invoking folio_ref_add() only upon a removal failure. However, I
> > > > decided against it since it's not a must before guest_memfd supports in-place
> > > > conversion.
> > > >
> > > > We can include it in the next version If you think it's better.
> > >
> > > Ackerley is planning to send out a series for 1G Hugetlb support with
> > > guest memfd soon, hopefully this week. Plus I don't see any reason to
> > > hold extra refcounts in TDX stack so it would be good to clean up this
> > > logic.
> > >
> > > >
> > > > > invalidation. folio_ref_add() will not work for memory not backed by
> > > > > page structs, but that problem can be solved in future possibly by
> > > > With current TDX code, all memory must be backed by a page struct.
> > > > Both tdh_mem_page_add() and tdh_mem_page_aug() require a "struct page *" rather
> > > > than a pfn.
> > > >
> > > > > notifying guest_memfd of certain ranges being in use even after
> > > > > invalidation completes.
> > > > A curious question:
> > > > To support memory not backed by page structs in future, is there any counterpart
> > > > to the page struct to hold ref count and map count?
> > > >
> > >
> > > I imagine the needed support will match similar semantics as VM_PFNMAP
> > > [1] memory. No need to maintain refcounts/map counts for such physical
> > > memory ranges as all users will be notified when mappings are
> > > changed/removed.
> > So, it's possible to map such memory in both shared and private EPT
> > simultaneously?
> 
> No, guest_memfd will still ensure that userspace can only fault in
> shared memory regions in order to support CoCo VM usecases.
Before guest_memfd converts a PFN from shared to private, how does it ensure
there are no shared mappings? e.g., in [1], it uses the folio reference count
to ensure that.

Or do you believe that by eliminating the struct page, there would be no
GUP, thereby ensuring no shared mappings by requiring all mappers to unmap in
response to a guest_memfd invalidation notification?

As in Documentation/core-api/pin_user_pages.rst, long-term pinning users have
no need to register mmu notifier. So why users like VFIO must register
guest_memfd invalidation notification?

Besides, how would guest_memfd handle potential unmap failures? e.g. what
happens to prevent converting a private PFN to shared if there are errors when
TDX unmaps a private PFN or if a device refuses to stop DMAing to a PFN.

Currently, guest_memfd can rely on page ref count to avoid re-assigning a PFN
that fails to be unmapped.


[1] https://lore.kernel.org/all/20250328153133.3504118-5-tabba@google.com/


> >
> >
> > > Any guest_memfd range updates will result in invalidations/updates of
> > > userspace, guest, IOMMU or any other page tables referring to
> > > guest_memfd backed pfns. This story will become clearer once the
> > > support for PFN range allocator for backing guest_memfd starts getting
> > > discussed.
> > Ok. It is indeed unclear right now to support such kind of memory.
> >
> > Up to now, we don't anticipate TDX will allow any mapping of VM_PFNMAP memory
> > into private EPT until TDX connect.
> 
> There is a plan to use VM_PFNMAP memory for all of guest_memfd
> shared/private ranges orthogonal to TDX connect usecase. With TDX
> connect/Sev TIO, major difference would be that guest_memfd private
> ranges will be mapped into IOMMU page tables.
> 
> Irrespective of whether/when VM_PFNMAP memory support lands, there
> have been discussions on not using page structs for private memory
> ranges altogether [1] even with hugetlb allocator, which will simplify
> seamless merge/split story for private hugepages to support memory
> conversion. So I think the general direction we should head towards is
> not relying on refcounts for guest_memfd private ranges and/or page
> structs altogether.
It's fine to use PFN, but I wonder if there're counterparts of struct page to
keep all necessary info.

 
> I think the series [2] to work better with PFNMAP'd physical memory in
> KVM is in the very right direction of not assuming page struct backed
> memory ranges for guest_memfd as well.
Note: Currently, VM_PFNMAP is usually used together with flag VM_IO. in KVM
hva_to_pfn_remapped() only applies to "vma->vm_flags & (VM_IO | VM_PFNMAP)".


> [1] https://lore.kernel.org/all/CAGtprH8akKUF=8+RkX_QMjp35C0bU1zxGi4v1Zm5AWCw=8V8AQ@mail.gmail.com/
> [2] https://lore.kernel.org/linux-arm-kernel/20241010182427.1434605-1-seanjc@google.com/
> 
> > And even in that scenario, the memory is only for private MMIO, so the backend
> > driver is VFIO pci driver rather than guest_memfd.
> 
> Not necessary. As I mentioned above guest_memfd ranges will be backed
> by VM_PFNMAP memory.
> 
> >
> >
> > > [1] https://elixir.bootlin.com/linux/v6.14.5/source/mm/memory.c#L6543