linux-kernel - Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzecvxizp5.fsf@ackerleytng-ctop.c.googlers.com>
Date: Thu, 05 Jun 2025 15:35:50 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: vannapurve@...gle.com, pbonzini@...hat.com, seanjc@...gle.com, 
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org, 
	rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com, 
	tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com, 
	david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com, 
	pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com, 
	xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Hi Yan,
>> 
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
> ...
>> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>
> Hi Ackerley,
> Thanks for providing this branch.

Here's the WIP branch [1], which I initially wasn't intending to make
super public since it's not even RFC standard yet and I didn't want to
add to the many guest_memfd in-flight series, but since you referred to
it, [2] is a v2 of the WIP branch :)

[1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2

This WIP branch has selftests that test 1G aka HugeTLB page support with
TDX huge page EPT mappings [7]:

1. "KVM: selftests: TDX: Test conversion to private at different
   sizes". This uses the fact that TDX module will return error if the
   page is faulted into the guest at a different level from the accept
   level to check the level that the page was faulted in.
2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
   private_mem_conversions_test for use with TDs. This test does
   multi-vCPU conversions and we use this to check for issues to do with
   conversion races.
3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
   private and shared memory". Adds a selftest similar to/on top of
   guest_memfd_conversions_test that does conversions via MapGPA.

Full list of selftests I usually run from tools/testing/selftests/kvm:

+ ./guest_memfd_test
+ ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
+ ./x86/private_mem_conversions_test.sh
+ ./set_memory_region_test
+ ./x86/private_mem_kvm_exits_test
+ ./x86/tdx_vm_test
+ ./x86/tdx_upm_test
+ ./x86/tdx_shared_mem_test
+ ./x86/tdx_gmem_private_and_shared_test

As an overview for anyone who might be interested in this WIP branch:

1.  I started with upstream's kvm/next
2.  Applied TDX selftests series [3]
3.  Applied guest_memfd mmap series [4]
4.  Applied conversions (sub)series and HugeTLB (sub)series [5]
5.  Added some fixes for 2 of the earlier series (as labeled in commit
    message)
6.  Updated guest_memfd conversions selftests to work with TDX
7.  Applied 2M EPT series [6] with some hacks
8.  Some patches to make guest_memfd mmap return huge-page-aligned
    userspace address
9.  Selftests for guest_memfd conversion with TDX 2M EPT

[3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
[4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
[7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/

>
> I'm now trying to make TD huge pages working on this branch and would like to
> report to you errors I encountered during this process early.
>
> 1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
>    I currently workaround it as follows:
>
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
>         void *priv;
>
>         inode = file_inode(file);
> -       if (!kvm_gmem_has_custom_allocator(inode))
> -             return arch_get_align_mask(file, flags);
> +       if (!kvm_gmem_has_custom_allocator(inode)) {
> +               page_size = 1 << PAGE_SHIFT;
> +               return PAGE_MASK & (page_size - 1);
> +       }
>
>

Thanks, will fix in the next revision.

> 2. Bug of Sleeping function called from invalid context 
>
> [  193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
> [  193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
> [  193.556235] preempt_count: 1, expected: 0
> [  193.564518] RCU nest depth: 0, expected: 0
> [  193.572866] 3 locks held by guest_memfd_con/3332:
> [  193.581800]  #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
> [  193.598252]  #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
> [  193.616706]  #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
> [  193.635790] Preemption disabled at:
> [  193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
>
> This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
> which is a kind of spinlock.
>
> I workarounded it as follows.
>
>  static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                                              pgoff_t start, pgoff_t end,
> @@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
>                         KVM_MMU_LOCK(kvm);
>                         kvm_mmu_invalidate_begin(kvm);
>
> -                       if (invalidated_kvms) {
> -                               ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
> -                               if (ret) {
> -                                       kvm_mmu_invalidate_end(kvm);
> -                                       goto out;
> -                               }
> -                       }
>                 }
>
>
> @@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>         }
>
>  out:
> -       list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
> -               kvm_gmem_do_invalidate_end(entry->kvm);
> -               list_del(&entry->list);
> -               kfree(entry);
> -       }
> +       list_for_each_entry(gmem, gmem_list, entry)
> +               kvm_gmem_do_invalidate_end(gmem->kvm);
>
>         filemap_invalidate_unlock(inode->i_mapping);
>
>

I fixed this in WIP series v2 by grouping splitting with
unmapping. Please see this commit [8], the commit message includes an
explanation of what's done.

[8] https://github.com/googleprodkernel/linux-cc/commit/fd27635e5209b5e45a628d7fcf42a17a2b3c7e78

> Will let you know more findings later.
>
> Thanks
> Yan