linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGtprH9R5AjnuOHsmAOzXL8rwE=yTJbQN=7kk6rfxmriB9okKQ@mail.gmail.com>
Date: Fri, 20 Jun 2025 11:06:34 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Ackerley Tng <ackerleytng@...gle.com>, pbonzini@...hat.com, seanjc@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

On Thu, Jun 19, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@...el.com> wrote:
>
> On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@...el.com> writes:
> >
> > > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> > >> Hi Yan,
> > >>
> > >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> > >> series [1], we took into account conversion failures too. The steps are
> > >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> > >> series from GitHub [2] because the steps for conversion changed in two
> > >> separate patches.)
> > > ...
> > >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > >
> > > Hi Ackerley,
> > > Thanks for providing this branch.
> >
> > Here's the WIP branch [1], which I initially wasn't intending to make
> > super public since it's not even RFC standard yet and I didn't want to
> > add to the many guest_memfd in-flight series, but since you referred to
> > it, [2] is a v2 of the WIP branch :)
> >
> > [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> > [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
> Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].
>
>
> > This WIP branch has selftests that test 1G aka HugeTLB page support with
> > TDX huge page EPT mappings [7]:
> >
> > 1. "KVM: selftests: TDX: Test conversion to private at different
> >    sizes". This uses the fact that TDX module will return error if the
> >    page is faulted into the guest at a different level from the accept
> >    level to check the level that the page was faulted in.
> > 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
> >    private_mem_conversions_test for use with TDs. This test does
> >    multi-vCPU conversions and we use this to check for issues to do with
> >    conversion races.
> > 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
> >    private and shared memory". Adds a selftest similar to/on top of
> >    guest_memfd_conversions_test that does conversions via MapGPA.
> >
> > Full list of selftests I usually run from tools/testing/selftests/kvm:
> > + ./guest_memfd_test
> > + ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> > + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> > + ./x86/private_mem_conversions_test.sh
> > + ./set_memory_region_test
> > + ./x86/private_mem_kvm_exits_test
> > + ./x86/tdx_vm_test
> > + ./x86/tdx_upm_test
> > + ./x86/tdx_shared_mem_test
> > + ./x86/tdx_gmem_private_and_shared_test
> >
> > As an overview for anyone who might be interested in this WIP branch:
> >
> > 1.  I started with upstream's kvm/next
> > 2.  Applied TDX selftests series [3]
> > 3.  Applied guest_memfd mmap series [4]
> > 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> > 5.  Added some fixes for 2 of the earlier series (as labeled in commit
> >     message)
> > 6.  Updated guest_memfd conversions selftests to work with TDX
> > 7.  Applied 2M EPT series [6] with some hacks
> > 8.  Some patches to make guest_memfd mmap return huge-page-aligned
> >     userspace address
> > 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> >
> > [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> > [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> > [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> > [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> > [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
> Thanks.
> We noticed that it's not easy for TDX initial memory regions to use in-place
> conversion version of guest_memfd, because
> - tdh_mem_page_add() requires simultaneous access to shared source memory and
>   private target memory.
> - shared-to-private in-place conversion first unmaps the shared memory and tests
>   if any extra folio refcount is held before the conversion is allowed.
>
> Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
> we can't store the initial content in the mmap-ed VA of the in-place conversion
> version of guest_memfd.
>
> So, I modified QEMU to workaround this issue by adding an extra anonymous
> backend to hold source pages in shared memory, with the target private PFN
> allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

Yeah, this scheme of using different memory backing for initial
payload makes sense to me.