[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <diqzecvxizp5.fsf@ackerleytng-ctop.c.googlers.com>
Date: Thu, 05 Jun 2025 15:35:50 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: vannapurve@...gle.com, pbonzini@...hat.com, seanjc@...gle.com,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org, x86@...nel.org,
rick.p.edgecombe@...el.com, dave.hansen@...el.com, kirill.shutemov@...el.com,
tabba@...gle.com, quic_eberman@...cinc.com, michael.roth@....com,
david@...hat.com, vbabka@...e.cz, jroedel@...e.de, thomas.lendacky@....com,
pgonda@...gle.com, zhiquan1.li@...el.com, fan.du@...el.com,
jun.miao@...el.com, ira.weiny@...el.com, isaku.yamahata@...el.com,
xiaoyao.li@...el.com, binbin.wu@...ux.intel.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH 08/21] KVM: TDX: Increase/decrease folio ref for huge pages
Yan Zhao <yan.y.zhao@...el.com> writes:
> On Wed, Jun 04, 2025 at 01:02:54PM -0700, Ackerley Tng wrote:
>> Hi Yan,
>>
>> While working on the 1G (aka HugeTLB) page support for guest_memfd
>> series [1], we took into account conversion failures too. The steps are
>> in kvm_gmem_convert_range(). (It might be easier to pull the entire
>> series from GitHub [2] because the steps for conversion changed in two
>> separate patches.)
> ...
>> [2] https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>
> Hi Ackerley,
> Thanks for providing this branch.
Here's the WIP branch [1], which I initially wasn't intending to make
super public since it's not even RFC standard yet and I didn't want to
add to the many guest_memfd in-flight series, but since you referred to
it, [2] is a v2 of the WIP branch :)
[1] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept
[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept-v2
This WIP branch has selftests that test 1G aka HugeTLB page support with
TDX huge page EPT mappings [7]:
1. "KVM: selftests: TDX: Test conversion to private at different
sizes". This uses the fact that TDX module will return error if the
page is faulted into the guest at a different level from the accept
level to check the level that the page was faulted in.
2. "KVM: selftests: Test TDs in private_mem_conversions_test". Updates
private_mem_conversions_test for use with TDs. This test does
multi-vCPU conversions and we use this to check for issues to do with
conversion races.
3. "KVM: selftests: TDX: Test conversions when guest_memfd used for
private and shared memory". Adds a selftest similar to/on top of
guest_memfd_conversions_test that does conversions via MapGPA.
Full list of selftests I usually run from tools/testing/selftests/kvm:
+ ./guest_memfd_test
+ ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_conversions_test
+ ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_hugetlb_reporting_test
+ ./x86/private_mem_conversions_test.sh
+ ./set_memory_region_test
+ ./x86/private_mem_kvm_exits_test
+ ./x86/tdx_vm_test
+ ./x86/tdx_upm_test
+ ./x86/tdx_shared_mem_test
+ ./x86/tdx_gmem_private_and_shared_test
As an overview for anyone who might be interested in this WIP branch:
1. I started with upstream's kvm/next
2. Applied TDX selftests series [3]
3. Applied guest_memfd mmap series [4]
4. Applied conversions (sub)series and HugeTLB (sub)series [5]
5. Added some fixes for 2 of the earlier series (as labeled in commit
message)
6. Updated guest_memfd conversions selftests to work with TDX
7. Applied 2M EPT series [6] with some hacks
8. Some patches to make guest_memfd mmap return huge-page-aligned
userspace address
9. Selftests for guest_memfd conversion with TDX 2M EPT
[3] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
[4] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[5] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[6] https://lore.kernel.org/all/Z%2FOMB7HNO%2FRQyljz@yzhao56-desk.sh.intel.com/
[7] https://lore.kernel.org/all/20250424030033.32635-1-yan.y.zhao@intel.com/
>
> I'm now trying to make TD huge pages working on this branch and would like to
> report to you errors I encountered during this process early.
>
> 1. symbol arch_get_align_mask() is not available when KVM is compiled as module.
> I currently workaround it as follows:
>
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -102,8 +102,13 @@ static unsigned long kvm_gmem_get_align_mask(struct file *file,
> void *priv;
>
> inode = file_inode(file);
> - if (!kvm_gmem_has_custom_allocator(inode))
> - return arch_get_align_mask(file, flags);
> + if (!kvm_gmem_has_custom_allocator(inode)) {
> + page_size = 1 << PAGE_SHIFT;
> + return PAGE_MASK & (page_size - 1);
> + }
>
>
Thanks, will fix in the next revision.
> 2. Bug of Sleeping function called from invalid context
>
> [ 193.523469] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:325
> [ 193.539885] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3332, name: guest_memfd_con
> [ 193.556235] preempt_count: 1, expected: 0
> [ 193.564518] RCU nest depth: 0, expected: 0
> [ 193.572866] 3 locks held by guest_memfd_con/3332:
> [ 193.581800] #0: ff16f8ec217e4438 (sb_writers#14){.+.+}-{0:0}, at: __x64_sys_fallocate+0x46/0x80
> [ 193.598252] #1: ff16f8fbd85c8310 (mapping.invalidate_lock#4){++++}-{4:4}, at: kvm_gmem_fallocate+0x9e/0x310 [kvm]
> [ 193.616706] #2: ff3189b5e4f65018 (&(kvm)->mmu_lock){++++}-{3:3}, at: kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
> [ 193.635790] Preemption disabled at:
> [ 193.635793] [<ffffffffc0850c6f>] kvm_gmem_invalidate_begin_and_zap+0x17f/0x260 [kvm]
>
> This is because add_to_invalidated_kvms() invokes kzalloc() inside kvm->mmu_lock
> which is a kind of spinlock.
>
> I workarounded it as follows.
>
> static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
> pgoff_t start, pgoff_t end,
> @@ -1261,13 +1268,13 @@ static int kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
> KVM_MMU_LOCK(kvm);
> kvm_mmu_invalidate_begin(kvm);
>
> - if (invalidated_kvms) {
> - ret = add_to_invalidated_kvms(invalidated_kvms, kvm);
> - if (ret) {
> - kvm_mmu_invalidate_end(kvm);
> - goto out;
> - }
> - }
> }
>
>
> @@ -1523,12 +1530,14 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> }
>
> out:
> - list_for_each_entry_safe(entry, tmp, &invalidated_kvms, list) {
> - kvm_gmem_do_invalidate_end(entry->kvm);
> - list_del(&entry->list);
> - kfree(entry);
> - }
> + list_for_each_entry(gmem, gmem_list, entry)
> + kvm_gmem_do_invalidate_end(gmem->kvm);
>
> filemap_invalidate_unlock(inode->i_mapping);
>
>
I fixed this in WIP series v2 by grouping splitting with
unmapping. Please see this commit [8], the commit message includes an
explanation of what's done.
[8] https://github.com/googleprodkernel/linux-cc/commit/fd27635e5209b5e45a628d7fcf42a17a2b3c7e78
> Will let you know more findings later.
>
> Thanks
> Yan
Powered by blists - more mailing lists