linux-kernel - Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqzh62ezgdh.fsf@ackerleytng-ctop.c.googlers.com>
Date: Wed, 23 Apr 2025 15:02:02 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: tabba@...gle.com, quic_eberman@...cinc.com, roypat@...zon.co.uk, 
	jgg@...dia.com, peterx@...hat.com, david@...hat.com, rientjes@...gle.com, 
	fvdl@...gle.com, jthoughton@...gle.com, seanjc@...gle.com, 
	pbonzini@...hat.com, zhiquan1.li@...el.com, fan.du@...el.com, 
	jun.miao@...el.com, isaku.yamahata@...el.com, muchun.song@...ux.dev, 
	erdemaktas@...gle.com, vannapurve@...gle.com, qperret@...gle.com, 
	jhubbard@...dia.com, willy@...radead.org, shuah@...nel.org, 
	brauner@...nel.org, bfoster@...hat.com, kent.overstreet@...ux.dev, 
	pvorel@...e.cz, rppt@...nel.org, richard.weiyang@...il.com, 
	anup@...infault.org, haibo1.xu@...el.com, ajones@...tanamicro.com, 
	vkuznets@...hat.com, maciej.wieczor-retman@...el.com, pgonda@...gle.com, 
	oliver.upton@...ux.dev, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	kvm@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct
 HugeTLB page

Yan Zhao <yan.y.zhao@...el.com> writes:

> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>> +/*
>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> + */
>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> +							    pgoff_t index)
>> +{
>> +	struct kvm_gmem_hugetlb *hgmem;
>> +	pgoff_t aligned_index;
>> +	struct folio *folio;
>> +	int nr_pages;
>> +	int ret;
>> +
>> +	hgmem = kvm_gmem_hgmem(inode);
>> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> +	if (IS_ERR(folio))
>> +		return folio;
>> +
>> +	nr_pages = 1UL << huge_page_order(hgmem->h);
>> +	aligned_index = round_down(index, nr_pages);
> Maybe a gap here.
>
> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> corresponding GFN is not 2M/1G aligned.

Thanks for looking into this.

In 1G page support for guest_memfd, the offset and size are always
hugepage aligned to the hugepage size requested at guest_memfd creation
time, and it is true that when binding to a memslot, slot->base_gfn and
slot->npages may not be hugepage aligned.

>
> However, TDX requires that private huge pages be 2M aligned in GFN.
>

IIUC other factors also contribute to determining the mapping level in
the guest page tables, like lpage_info and .private_max_mapping_level()
in kvm_x86_ops.

If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
will track that and not allow faulting into guest page tables at higher
granularity.

Hence I think it is okay to leave it to KVM to fault pages into the
guest correctly. For guest_memfd will just maintain the invariant that
offset and size are hugepage aligned, but not require that
slot->base_gfn and slot->npages are hugepage aligned. This behavior will
be consistent with other backing memory for guests like regular shmem or
HugeTLB.

>> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
>> +						 aligned_index,
>> +						 htlb_alloc_mask(hgmem->h));
>> +	WARN_ON(ret);
>> +
>>  	spin_lock(&inode->i_lock);
>>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
>>  	spin_unlock(&inode->i_lock);
>>  
>> -	return page_folio(requested_page);
>> +	return folio;
>> +}