linux-kernel - Re: [PATCH v3 29/30] luo: allow preserving memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPaSQN04M-pvpFTjjpzk3pfHNhpx+mCkvWpZOs=0TF3gg@mail.gmail.com>
Date: Fri, 29 Aug 2025 12:18:43 -0700
From: Chris Li <chrisl@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Pasha Tatashin <pasha.tatashin@...een.com>, pratyush@...nel.org, jasonmiu@...gle.com, 
	graf@...zon.com, changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com, 
	rientjes@...gle.com, corbet@....net, rdunlap@...radead.org, 
	ilpo.jarvinen@...ux.intel.com, kanie@...ux.alibaba.com, ojeda@...nel.org, 
	aliceryhl@...gle.com, masahiroy@...nel.org, akpm@...ux-foundation.org, 
	tj@...nel.org, yoann.congal@...le.fr, mmaurer@...gle.com, 
	roman.gushchin@...ux.dev, chenridong@...wei.com, axboe@...nel.dk, 
	mark.rutland@....com, jannh@...gle.com, vincent.guittot@...aro.org, 
	hannes@...xchg.org, dan.j.williams@...el.com, david@...hat.com, 
	joel.granados@...nel.org, rostedt@...dmis.org, anna.schumaker@...cle.com, 
	song@...nel.org, zhangguopeng@...inos.cn, linux@...ssschuh.net, 
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, linux-mm@...ck.org, 
	gregkh@...uxfoundation.org, tglx@...utronix.de, mingo@...hat.com, 
	bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, 
	rafael@...nel.org, dakr@...nel.org, bartosz.golaszewski@...aro.org, 
	cw00.choi@...sung.com, myungjoo.ham@...sung.com, yesanishhere@...il.com, 
	Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com, 
	aleksander.lobakin@...el.com, ira.weiny@...el.com, 
	andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de, 
	bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com, 
	stuart.w.hayes@...il.com, ptyadav@...zon.de, lennart@...ttering.net, 
	brauner@...nel.org, linux-api@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	saeedm@...dia.com, ajayachandra@...dia.com, parav@...dia.com, 
	leonro@...dia.com, witu@...dia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd

On Tue, Aug 26, 2025 at 9:20 AM Jason Gunthorpe <jgg@...dia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
> > +     /*
> > +      * Most of the space should be taken by preserved folios. So take its
> > +      * size, plus a page for other properties.
> > +      */
> > +     fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > +     if (!fdt) {
> > +             err = -ENOMEM;
> > +             goto err_unpin;
> > +     }
>
> This doesn't seem to have any versioning scheme, it really should..
>
> > +     err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +                                    (void **)&preserved_folios);
> > +     if (err) {
> > +             pr_err("Failed to reserve folios property in FDT: %s\n",
> > +                    fdt_strerror(err));
> > +             err = -ENOMEM;
> > +             goto err_free_fdt;
> > +     }
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'

Yes, that will be one step forward.

Another idea is that having a middle layer manages the life cycle of
the reserved memory for you. Kind of like a slab allocator for the
preserved memory. It allows bulk free if there is an error on the live
update prepare(), you need to free all previously allocated memory
anyway. If there is some preserved memory that needs to stay after a
long term after the live update kernel boot up, use some special flags
to indicate so don't mix the free_all pool.
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.
>
> > +     for (; i < nr_pfolios; i++) {
> > +             const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> > +             phys_addr_t phys;
> > +             u64 index;
> > +             int flags;
> > +
> > +             if (!pfolio->foliodesc)
> > +                     continue;
> > +
> > +             phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > +             folio = kho_restore_folio(phys);
> > +             if (!folio) {
> > +                     pr_err("Unable to restore folio at physical address: %llx\n",
> > +                            phys);
> > +                     goto put_file;
> > +             }
> > +             index = pfolio->index;
> > +             flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> > +
> > +             /* Set up the folio for insertion. */
> > +             /*
> > +              * TODO: Should find a way to unify this and
> > +              * shmem_alloc_and_add_folio().
> > +              */
> > +             __folio_set_locked(folio);
> > +             __folio_set_swapbacked(folio);
> >
> > +             ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> > +             if (ret) {
> > +                     pr_err("shmem: failed to charge folio index %d: %d\n",
> > +                            i, ret);
> > +                     goto unlock_folio;
> > +             }
>
> [..]
>
> > +             folio_add_lru(folio);
> > +             folio_unlock(folio);
> > +             folio_put(folio);
> > +     }
>
> Probably some consolidation will be needed to make this less
> duplicated..
>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.
>
> Which seems to me to be entirely fine as:
>
>  struct memfd_luo_v0 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios;
>  };
>
>  struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
>  struct memfd_luo_v1 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios_list_with_holes;
>  };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Question: Do we have a matching FDT node to match the memfd C
structure hierarchy? Otherwise all the C struct will lump into one FDT
node. Maybe one FDT node for all C struct is fine. Then there is a
risk of overflowing the 4K buffer limit on the FDT node.

I would like to get independent of FDT for the versioning.

FDT on the top level sounds OK. Not ideal but workable. We are getting
deeper and deeper into complex internal data structures. Do we still
want every data structure referenced by a FDT identifier?

> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
>  if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
>     restore_v1(&memfd_luo_v1)
>  else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
>     restore_v0(&memfd_luo_v0)
>  else
>     luo_failure("Do not understand this");
>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.

Obviously, this can be done.

Is that approach you want to expand to every other C struct as well?
See the above FDT node complexity.

I am getting the feeling that we are hand crafting screws to build an
airplane. Can it be done? Of course. Does it scale well? I am not
sure. There are many developers who are currently hand-crafting this
kind of screws to be used on the different components of the airplane.

We need a machine that can stamp out screws with our specifications,
faster. I want such a machine. Other developers might want one as
well.

The initial discussion of the idea of such a machine is pretty
discouraged. There are huge communication barriers because of the
fixation on hand crafted screws. I understand exploring such machine
ideas alone might distract the engineer from hand crafting more
screws, one of them might realize that, oh, I want such a machine as
well.

At this stage, do you see that exploring such a machine idea can be
beneficial or harmful to the project? If such an idea is considered
harmful, we should stop discussing such an idea at all. Go back to
building more batches of hand crafted screws, which are waiting by the
next critical component.

Also if such a machine can produce screws up to your specification,
but it has a different look and feel than the hand crafted screws. We
can stamp out the screw faster.  Would you consider putting such a
machined screw on your most critical component of the engine?

Best Regards,

Chris