[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250826162019.GD2130239@nvidia.com>
Date: Tue, 26 Aug 2025 13:20:19 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: pratyush@...nel.org, jasonmiu@...gle.com, graf@...zon.com,
changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com,
rientjes@...gle.com, corbet@....net, rdunlap@...radead.org,
ilpo.jarvinen@...ux.intel.com, kanie@...ux.alibaba.com,
ojeda@...nel.org, aliceryhl@...gle.com, masahiroy@...nel.org,
akpm@...ux-foundation.org, tj@...nel.org, yoann.congal@...le.fr,
mmaurer@...gle.com, roman.gushchin@...ux.dev, chenridong@...wei.com,
axboe@...nel.dk, mark.rutland@....com, jannh@...gle.com,
vincent.guittot@...aro.org, hannes@...xchg.org,
dan.j.williams@...el.com, david@...hat.com,
joel.granados@...nel.org, rostedt@...dmis.org,
anna.schumaker@...cle.com, song@...nel.org, zhangguopeng@...inos.cn,
linux@...ssschuh.net, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, linux-mm@...ck.org,
gregkh@...uxfoundation.org, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, rafael@...nel.org, dakr@...nel.org,
bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
myungjoo.ham@...sung.com, yesanishhere@...il.com,
Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com,
aleksander.lobakin@...el.com, ira.weiny@...el.com,
andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
stuart.w.hayes@...il.com, ptyadav@...zon.de, lennart@...ttering.net,
brauner@...nel.org, linux-api@...r.kernel.org,
linux-fsdevel@...r.kernel.org, saeedm@...dia.com,
ajayachandra@...dia.com, parav@...dia.com, leonro@...dia.com,
witu@...dia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd
On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> + /*
> + * Most of the space should be taken by preserved folios. So take its
> + * size, plus a page for other properties.
> + */
> + fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> + if (!fdt) {
> + err = -ENOMEM;
> + goto err_unpin;
> + }
This doesn't seem to have any versioning scheme, it really should..
> + err = fdt_property_placeholder(fdt, "folios", preserved_size,
> + (void **)&preserved_folios);
> + if (err) {
> + pr_err("Failed to reserve folios property in FDT: %s\n",
> + fdt_strerror(err));
> + err = -ENOMEM;
> + goto err_free_fdt;
> + }
Yuk.
This really wants some luo helper
'luo alloc array'
'luo restore array'
'luo free array'
Which would get a linearized list of pages in the vmap to hold the
array and then allocate some structure to record the page list and
return back the u64 of the phys_addr of the top of the structure to
store in whatever.
Getting fdt to allocate the array inside the fds is just not going to
work for anything of size.
> + for (; i < nr_pfolios; i++) {
> + const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> + phys_addr_t phys;
> + u64 index;
> + int flags;
> +
> + if (!pfolio->foliodesc)
> + continue;
> +
> + phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> + folio = kho_restore_folio(phys);
> + if (!folio) {
> + pr_err("Unable to restore folio at physical address: %llx\n",
> + phys);
> + goto put_file;
> + }
> + index = pfolio->index;
> + flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> +
> + /* Set up the folio for insertion. */
> + /*
> + * TODO: Should find a way to unify this and
> + * shmem_alloc_and_add_folio().
> + */
> + __folio_set_locked(folio);
> + __folio_set_swapbacked(folio);
>
> + ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> + if (ret) {
> + pr_err("shmem: failed to charge folio index %d: %d\n",
> + i, ret);
> + goto unlock_folio;
> + }
[..]
> + folio_add_lru(folio);
> + folio_unlock(folio);
> + folio_put(folio);
> + }
Probably some consolidation will be needed to make this less
duplicated..
But overall I think just using the memfd_luo_preserved_folio as the
serialization is entirely file, I don't think this needs anything more
complicated.
What it does need is an alternative to the FDT with versioning.
Which seems to me to be entirely fine as:
struct memfd_luo_v0 {
__aligned_u64 size;
__aligned_u64 pos;
__aligned_u64 folios;
};
struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
Which also shows the actual data needing to be serialized comes from
more than one struct and has to be marshaled in code, somehow, to a
single struct.
Then I imagine a fairly simple forwards/backwards story. If something
new is needed that is non-optional, lets say you compress the folios
list to optimize holes:
struct memfd_luo_v1 {
__aligned_u64 size;
__aligned_u64 pos;
__aligned_u64 folios_list_with_holes;
};
Obviously a v0 kernel cannot parse this, but in this case a v1 aware
kernel could optionally duplicate and write out the v0 format as well:
luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
Then the rule is fairly simple, when the sucessor kernel goes to
deserialize it asks luo for the versions it supports:
if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
restore_v1(&memfd_luo_v1)
else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
restore_v0(&memfd_luo_v0)
else
luo_failure("Do not understand this");
luo core just manages this list of versioned data per serialized
object. There is only one version per object.
Jason
Powered by blists - more mailing lists