lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c5c964cd-947e-43ff-9c79-18c1555aea8e@redhat.com>
Date: Mon, 14 Jul 2025 18:02:58 +0200
From: David Hildenbrand <david@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>, Yan Zhao <yan.y.zhao@...el.com>
Cc: Michael Roth <michael.roth@....com>, pbonzini@...hat.com,
 kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 rick.p.edgecombe@...el.com, kai.huang@...el.com, adrian.hunter@...el.com,
 reinette.chatre@...el.com, xiaoyao.li@...el.com, tony.lindgren@...el.com,
 binbin.wu@...ux.intel.com, dmatlack@...gle.com, isaku.yamahata@...el.com,
 ira.weiny@...el.com, vannapurve@...gle.com, ackerleytng@...gle.com,
 tabba@...gle.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from
 kvm_gmem_populate()

On 14.07.25 17:46, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Yan Zhao wrote:
>> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
>>> On Fri, Jul 11, 2025, Michael Roth wrote:
>>>> On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
>>>>> Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
>>>>> log:
>>>>>
>>>>> Problem
>>>>> ===
>>>>> ...
>>>>> (2)
>>>>> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
>>>>> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
>>>>> - filemap invalidation lock --> mm->mmap_lock
>>>>>
>>>>> However, in future code, the shared filemap invalidation lock will be held
>>>>> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
>>>>> - mm->mmap_lock --> filemap invalidation lock
>>>>
>>>> I wouldn't expect kvm_gmem_fault_shared() to trigger for the
>>>> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
>>>
>>> Irrespective of shared faults, I think the API could do with a bit of cleanup
>>> now that TDX has landed, i.e. now that we can see a bit more of the picture.
>>>
>>> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
>>> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
>>> with one page at a time.  So that needs to be changed.  I assume it's already
>> In TDX RFC v1, we deals with multiple pages at a time :)
>> https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
>>
>>> address in one of the many upcoming series, but it still shows a flaw in the API.
>>>
>>> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
>>> seems pretty straightforward, and would provide consistent ABI for all vendor
>>> flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
>>> obvious downside is that struct-page becomes a requirement for SNP, but that
>>>
>>> The below could be tweaked to batch get_user_pages() into an array of pointers,
>>> but given that both SNP and TDX can only operate on one 4KiB page at a time, and
>>> that hugepage support doesn't yet exist, trying to super optimize the hugepage
>>> case straightaway doesn't seem like a pressing concern.
>>
>>> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>>> 				struct file *file, gfn_t gfn, void __user *src,
>>> 				kvm_gmem_populate_cb post_populate, void *opaque)
>>> {
>>> 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
>>> 	struct page *src_page = NULL;
>>> 	bool is_prepared = false;
>>> 	struct folio *folio;
>>> 	int ret, max_order;
>>> 	kvm_pfn_t pfn;
>>>
>>> 	if (src) {
>>> 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
>> get_user_pages_fast()?
>>
>> get_user_pages() can't pass the assertion of mmap_assert_locked().
> 
> Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
> prefer to not use a fast variant, so that userspace isn't required to prefault
> (and pin?) the source.
> 
> So get_user_pages_unlocked()?

Yes, but likely we really want get_user_pages_fast(), which will 
fallback to GUP-slow (+take the lock) in case it doesn't find what it 
needs in the page tables.

get_user_pages_fast_only() would be the variant that doesn't fallback to 
GUP-slow.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ