[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250903150157.GH470103@nvidia.com>
Date: Wed, 3 Sep 2025 12:01:57 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Pratyush Yadav <pratyush@...nel.org>
Cc: Pasha Tatashin <pasha.tatashin@...een.com>, jasonmiu@...gle.com,
graf@...zon.com, changyuanl@...gle.com, rppt@...nel.org,
dmatlack@...gle.com, rientjes@...gle.com, corbet@....net,
rdunlap@...radead.org, ilpo.jarvinen@...ux.intel.com,
kanie@...ux.alibaba.com, ojeda@...nel.org, aliceryhl@...gle.com,
masahiroy@...nel.org, akpm@...ux-foundation.org, tj@...nel.org,
yoann.congal@...le.fr, mmaurer@...gle.com, roman.gushchin@...ux.dev,
chenridong@...wei.com, axboe@...nel.dk, mark.rutland@....com,
jannh@...gle.com, vincent.guittot@...aro.org, hannes@...xchg.org,
dan.j.williams@...el.com, david@...hat.com,
joel.granados@...nel.org, rostedt@...dmis.org,
anna.schumaker@...cle.com, song@...nel.org, zhangguopeng@...inos.cn,
linux@...ssschuh.net, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, linux-mm@...ck.org,
gregkh@...uxfoundation.org, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, rafael@...nel.org, dakr@...nel.org,
bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
myungjoo.ham@...sung.com, yesanishhere@...il.com,
Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com,
aleksander.lobakin@...el.com, ira.weiny@...el.com,
andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
stuart.w.hayes@...il.com, lennart@...ttering.net,
brauner@...nel.org, linux-api@...r.kernel.org,
linux-fsdevel@...r.kernel.org, saeedm@...dia.com,
ajayachandra@...dia.com, parav@...dia.com, leonro@...dia.com,
witu@...dia.com
Subject: Re: [PATCH v3 29/30] luo: allow preserving memfd
On Wed, Sep 03, 2025 at 04:10:37PM +0200, Pratyush Yadav wrote:
> > So, it could be useful, but I wouldn't use it for memfd, the vmalloc
> > approach is better and we shouldn't optimize for sparsness which
> > should never happen.
>
> I disagree. I think we are re-inventing the same data format with minor
> variations. I think we should define extensible fundamental data formats
> first, and then use those as the building blocks for the rest of our
> serialization logic.
page, vmalloc, slab seem to me to be the fundamental units of memory
management in linux, so they should get KHO support.
If you want to preserve a known-sized array you use vmalloc and then
write out the per-list items. If it is a dictionary/sparse array then
you write an index with each item too. This is all trivial and doesn't
really need more abstraction in of itself, IMHO.
> cases can then build on top of it. For example, the preservation bitmaps
> can get rid of their linked list logic and just use KHO array to hold
> and retrieve its bitmaps. It will make the serialization simpler.
I don't think the bitmaps should, the serialization here is very
special because it is not actually preserved, it just exists for the
time while the new kernel runs in scratch and is insta freed once the
allocators start up.
> I also don't get why you think sparseness "should never happen". For
> memfd for example, you say in one of your other emails that "And again
> in real systems we expect memfd to be fully populated too." Which
> systems and use cases do you have in mind? Why do you think people won't
> want a sparse memfd?
memfd should principally be used to back VM memory, and I expect VM
memory to be fully populated. Why would it be sparse?
> All in all, I think KHO array is going to prove useful and will make
> serialization for subsystems easier. I think sparseness will also prove
> useful but it is not a hill I want to die on. I am fine with starting
> with a non-sparse array if people really insist. But I do think we
> should go with KHO array as a base instead of re-inventing the linked
> list of pages again and again.
The two main advantages I see to the kho array design vs vmalloc is
that it should be a bit faster as it doesn't establish a vmap, and it
handles unknown size lists much better.
Are these important considerations? IDK.
As I said to Chris, I think we should see more examples of what we
actually need before assuming any certain datastructure is the best
choice.
So I'd stick to simpler open coded things and go back and improve them
than start out building the wrong shared data structure.
How about have at least three luo clients that show meaningful benefit
before proposing something beyond the fundamental page, vmalloc, slab
things?
> What do you mean by "data per version"? I think there should be only one
> version of the serialized object. Multiple versions of the same thing
> will get ugly real quick.
If you want to support backwards/forwards compatability then you
probably should support multiple versions as well. Otherwise it
could become quite hard to make downgrades..
Ideally I'd want to remove the upstream code for obsolete versions
fairly quickly so I'd imagine kernels will want to generate both
versions during the transition period and then eventually newer
kernels will only accept the new version.
I've argued before that the extended matrix of any kernel version to
any other kernel version should lie with the distro/CSP making the
kernel fork. They know what their upgrade sequence will be so they can
manage any missing versions to make it work.
Upstream should do like v6.1 to v6.2 only or something similarly well
constrained. I think this is a reasonable trade off to get subsystem
maintainers to even accept this stuff at all.
> Other than that, I think this could work well. I am guessing luo_object
> stores the version and gives us a way to query it on the other side. I
> think if we are letting LUO manage supported versions, it should be
> richer than just a list of strings. I think it should include a ops
> structure for deserializing each version. That would encapsulate the
> versioning more cleanly.
Yeah, sounds about right
Jason
Powered by blists - more mailing lists