lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzh602jdk8.fsf@ackerleytng-ctop.c.googlers.com>
Date: Thu, 26 Jun 2025 16:19:51 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	x86@...nel.org, linux-fsdevel@...r.kernel.org
Cc: aik@....com, ajones@...tanamicro.com, akpm@...ux-foundation.org, 
	amoorthy@...gle.com, anthony.yznaga@...cle.com, anup@...infault.org, 
	aou@...s.berkeley.edu, bfoster@...hat.com, binbin.wu@...ux.intel.com, 
	brauner@...nel.org, catalin.marinas@....com, chao.p.peng@...el.com, 
	chenhuacai@...nel.org, dave.hansen@...el.com, david@...hat.com, 
	dmatlack@...gle.com, dwmw@...zon.co.uk, erdemaktas@...gle.com, 
	fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, haibo1.xu@...el.com, 
	hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com, 
	thomas.lendacky@....com, usama.arif@...edance.com, vannapurve@...gle.com, 
	vbabka@...e.cz, viro@...iv.linux.org.uk, vkuznets@...hat.com, 
	wei.w.wang@...el.com, will@...nel.org, willy@...radead.org, 
	xiaoyao.li@...el.com, yan.y.zhao@...el.com, yilun.xu@...el.com, 
	yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Ackerley Tng <ackerleytng@...gle.com> writes:

> Hello,
>
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
>
> [...]

At the guest_memfd upstream call today (2025-06-26), we talked about
when to merge folios with respect to conversions.

Just want to call out that in this RFCv2, we managed to get conversions
working with merges happening as soon as possible.

"As soon as possible" means merges happen as long as shareability is all
private (or all meaningless) within an aligned hugepage range. We try to
merge after every conversion request and on truncation. On truncation,
shareability becomes meaningless.

On explicit truncation (e.g. fallocate(PUNCH_HOLE)), truncation can fail
if there are unexpected refcounts (because we can't merge with
unexpected refcounts). Explicit truncation will succeed only if
refcounts are expected, and merge is performed before finally removing
from filemap.

On truncation caused by file close or inode release, guest_memfd may not
hold the last refcount on the folio. Only in this case, we defer merging
to the folio_put() callback, and because the callback can be called from
atomic context, the merge is further deferred to be performed by a
kernel worker.

Deferment of merging is already minimized so that most of the
restructuring is synchronous with some userspace-initiated action
(conversion or explicit truncation). The only deferred merge is when the
file is closed, and in that case there's no way to reject/fail this file
close.

(There are possible optimizations here - Yan suggested [1] checking if
the folio_put() was called from interrupt context - I have not tried
implementing that yet)


I did propose an explicit guest_memfd merge ioctl, but since RFCv2
works, I was thinking to to have the merge ioctl be a separate
optimization/project/patch series if it turns out that merging
as-soon-as-possible is an inefficient strategy, or if some VM use cases
prefer to have an explicit merge ioctl.


During the call, Michael also brought up that SNP adds some constraints
with respect to guest accepting pages/levels.

Could you please expand on that? Suppose for an SNP guest,

1. Guest accepted a page at 2M level
2. Guest converts a 4K sub page to shared
3. guest_memfd requests unmapping of the guest-requested 4K range
   (the rest of the 2M remains mapped into stage 2 page tables)
4. guest_memfd splits the huge page to 4K pages (the 4K is set to
   SHAREABILITY_ALL, the rest of the 2M is still SHAREABILITY_GUEST)

Can the SNP guest continue to use the rest of the 2M page or must it
re-accept all the pages at 4K?

And for the reverse:

1. Guest accepted a 2M range at 4K
2. guest_memfd merges the full 2M range to a single 2M page

Must the SNP guest re-accept at 2M for the guest to continue
functioning, or will the SNP guest continue to work (just with poorer
performance than if the memory was accepted at 2M)?

[1] https://lore.kernel.org/all/aDfT35EsYP%2FByf7Z@yzhao56-desk.sh.intel.com/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ