linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFPGPVbzo92t565h@yzhao56-desk.sh.intel.com>
Date: Thu, 19 Jun 2025 16:11:41 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: <vannapurve@...gle.com>, <pbonzini@...hat.com>, <seanjc@...gle.com>,
	<linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>, <x86@...nel.org>,
	<rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kirill.shutemov@...el.com>, <tabba@...gle.com>, <quic_eberman@...cinc.com>,
	<michael.roth@....com>, <david@...hat.com>, <vbabka@...e.cz>,
	<jroedel@...e.de>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge
 pages

On Thu, Jun 05, 2025 at 03:35:50PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@...el.com> writes:
> 
> > On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
> >> Hi Yan,
> >> 
> >> While working on the 1G (aka HugeTLB) page support for guest_memfd
> >> series [1], we took into account conversion failures too. The steps are
> >> in kvm_gmem_convert_range(). (It might be easier to pull the entire
> >> series from GitHub [2] because the steps for conversion changed in two
> >> separate patches.)
> > ...
> >> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Hi Ackerley,
> > Thanks for providing this branch.
> 
> Here's the WIP branch [1], which I initially wasn't intending to make
> super public since it's not even RFC standard yet and I didn't want to
> add to the many guest_memfd in-flight series, but since you referred to
> it, [2] is a v2 of the WIP branch :)
> 
> [1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
> [2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
Thanks. [2] works. TDX huge pages now has successfully been rebased on top of [2].


> This WIP branch has selftests that test 1G aka HugeTLB page support with
> TDX huge page EPT mappings [7]:
> 
> 1. "KVM: selftests: TDX: Test conversion to private at different
>    sizes". This uses the fact that TDX module will return error if the
>    page is faulted into the guest at a different level from the accept
>    level to check the level that the page was faulted in.
> 2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
>    private_mem_conversions_test for use with TDs. This test does
>    multi-vCPU conversions and we use this to check for issues to do with
>    conversion races.
> 3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
>    private and shared memory". Adds a selftest similar to/on top of
>    guest_memfd_conversions_test that does conversions via MapGPA.
> 
> Full list of selftests I usually run from tools/testing/selftests/kvm:
> + ./guest_memfd_test
> + ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
> + ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
> + ./x86/private_mem_conversions_test.sh
> + ./set_memory_region_test
> + ./x86/private_mem_kvm_exits_test
> + ./x86/tdx_vm_test
> + ./x86/tdx_upm_test
> + ./x86/tdx_shared_mem_test
> + ./x86/tdx_gmem_private_and_shared_test
> 
> As an overview for anyone who might be interested in this WIP branch:
> 
> 1.  I started with upstream's kvm/next
> 2.  Applied TDX selftests series [3]
> 3.  Applied guest_memfd mmap series [4]
> 4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
> 5.  Added some fixes for 2 of the earlier series (as labeled in commit
>     message)
> 6.  Updated guest_memfd conversions selftests to work with TDX
> 7.  Applied 2M EPT series [6] with some hacks
> 8.  Some patches to make guest_memfd mmap return huge-page-aligned
>     userspace address
> 9.  Selftests for guest_memfd conversion with TDX 2M EPT
> 
> [3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> [4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> [5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
> [6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
> [7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
Thanks.
We noticed that it's not easy for TDX initial memory regions to use in-place
conversion version of guest_memfd, because
- tdh_mem_page_add() requires simultaneous access to shared source memory and
  private target memory.
- shared-to-private in-place conversion first unmaps the shared memory and tests
  if any extra folio refcount is held before the conversion is allowed.

Therefore, though tdh_mem_page_add() actually supports in-place add, see [8],
we can't store the initial content in the mmap-ed VA of the in-place conversion
version of guest_memfd.

So, I modified QEMU to workaround this issue by adding an extra anonymous
backend to hold source pages in shared memory, with the target private PFN
allocated from guest_memfd with GUEST_MEMFD_FLAG_SUPPORT_SHARED set.

The goal is to test whether kvm_gmem_populate() works for TDX huge pages.
This testing exposed a bug in kvm_gmem_populate(), which has been fixed in the
following patch.

commit 5f33ed7ca26f00a61c611d2d1fbc001a7ecd8dca
Author: Yan Zhao <yan.y.zhao@...el.com>
Date:   Mon Jun 9 03:01:21 2025 -0700

    Bug fix: Reduce max_order when GFN is not aligned

    Fix the warning hit in kvm_gmem_populate().

    "WARNING: CPU: 7 PID: 4421 at arch/x86/kvm/../../../virt/kvm/guest_memfd.c:
    2496 kvm_gmem_populate+0x4a4/0x5b0"

    The GFN passed to kvm_gmem_populate() may have an offset so it may not be
    aligned to folio order. In this case, reduce the max_order to decrease the
    mapping level.

    Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 4b8047020f17..af7943c0a8ba 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -2493,7 +2493,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
                }

                folio_unlock(folio);
-               WARN_ON(!IS_ALIGNED(gfn, 1 << max_order));
+               while (!IS_ALIGNED(gfn, 1 << max_order))
+                       max_order--;

                npages_to_populate = min(npages - i, 1 << max_order);
                npages_to_populate = private_npages_to_populate(



[8] https://cdrdv2-public.intel.com/839195/intel-tdx-module-1.5-abi-spec-348551002.pdf
"In-Place Add: It is allowed to set the TD page HPA in R8 to the same address as
the source page HPA in R9. In this case the source page is converted to be a TD
private page".