linux-kernel - Re: [PATCH v3 29/30] luo: allow preserving memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250826162019.GD2130239@nvidia.com>
Date: Tue, 26 Aug 2025 13:20:19 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: pratyush@...nel.org, jasonmiu@...gle.com, graf@...zon.com,
	changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com,
	rientjes@...gle.com, corbet@....net, rdunlap@...radead.org,
	ilpo.jarvinen@...ux.intel.com, kanie@...ux.alibaba.com,
	ojeda@...nel.org, aliceryhl@...gle.com, masahiroy@...nel.org,
	akpm@...ux-foundation.org, tj@...nel.org, yoann.congal@...le.fr,
	mmaurer@...gle.com, roman.gushchin@...ux.dev, chenridong@...wei.com,
	axboe@...nel.dk, mark.rutland@....com, jannh@...gle.com,
	vincent.guittot@...aro.org, hannes@...xchg.org,
	dan.j.williams@...el.com, david@...hat.com,
	joel.granados@...nel.org, rostedt@...dmis.org,
	anna.schumaker@...cle.com, song@...nel.org, zhangguopeng@...inos.cn,
	linux@...ssschuh.net, linux-kernel@...r.kernel.org,
	linux-doc@...r.kernel.org, linux-mm@...ck.org,
	gregkh@...uxfoundation.org, tglx@...utronix.de, mingo@...hat.com,
	bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
	hpa@...or.com, rafael@...nel.org, dakr@...nel.org,
	bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
	myungjoo.ham@...sung.com, yesanishhere@...il.com,
	Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com,
	aleksander.lobakin@...el.com, ira.weiny@...el.com,
	andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
	bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
	stuart.w.hayes@...il.com, ptyadav@...zon.de, lennart@...ttering.net,
	brauner@...nel.org, linux-api@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, saeedm@...dia.com,
	ajayachandra@...dia.com, parav@...dia.com, leonro@...dia.com,
	witu@...dia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd

On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:

> +	/*
> +	 * Most of the space should be taken by preserved folios. So take its
> +	 * size, plus a page for other properties.
> +	 */
> +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> +	if (!fdt) {
> +		err = -ENOMEM;
> +		goto err_unpin;
> +	}

This doesn't seem to have any versioning scheme, it really should..

> +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> +				       (void **)&preserved_folios);
> +	if (err) {
> +		pr_err("Failed to reserve folios property in FDT: %s\n",
> +		       fdt_strerror(err));
> +		err = -ENOMEM;
> +		goto err_free_fdt;
> +	}

Yuk.

This really wants some luo helper

'luo alloc array'
'luo restore array'
'luo free array'

Which would get a linearized list of pages in the vmap to hold the
array and then allocate some structure to record the page list and
return back the u64 of the phys_addr of the top of the structure to
store in whatever.

Getting fdt to allocate the array inside the fds is just not going to
work for anything of size.

> +	for (; i < nr_pfolios; i++) {
> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> +		phys_addr_t phys;
> +		u64 index;
> +		int flags;
> +
> +		if (!pfolio->foliodesc)
> +			continue;
> +
> +		phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> +		folio = kho_restore_folio(phys);
> +		if (!folio) {
> +			pr_err("Unable to restore folio at physical address: %llx\n",
> +			       phys);
> +			goto put_file;
> +		}
> +		index = pfolio->index;
> +		flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> +
> +		/* Set up the folio for insertion. */
> +		/*
> +		 * TODO: Should find a way to unify this and
> +		 * shmem_alloc_and_add_folio().
> +		 */
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> 
> +		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> +		if (ret) {
> +			pr_err("shmem: failed to charge folio index %d: %d\n",
> +			       i, ret);
> +			goto unlock_folio;
> +		}

[..]

> +		folio_add_lru(folio);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}

Probably some consolidation will be needed to make this less
duplicated..

But overall I think just using the memfd_luo_preserved_folio as the
serialization is entirely file, I don't think this needs anything more
complicated.

What it does need is an alternative to the FDT with versioning.

Which seems to me to be entirely fine as:

 struct memfd_luo_v0 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios;
 };

 struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);

Which also shows the actual data needing to be serialized comes from
more than one struct and has to be marshaled in code, somehow, to a
single struct.

Then I imagine a fairly simple forwards/backwards story. If something
new is needed that is non-optional, lets say you compress the folios
list to optimize holes:

 struct memfd_luo_v1 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios_list_with_holes;
 };

Obviously a v0 kernel cannot parse this, but in this case a v1 aware
kernel could optionally duplicate and write out the v0 format as well:

 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
 luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Then the rule is fairly simple, when the sucessor kernel goes to
deserialize it asks luo for the versions it supports:

 if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
    restore_v1(&memfd_luo_v1)
 else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
    restore_v0(&memfd_luo_v0)
 else
    luo_failure("Do not understand this");

luo core just manages this list of versioned data per serialized
object. There is only one version per object.

Jason