[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <mafs0bjm9lig8.fsf@kernel.org>
Date: Tue, 14 Oct 2025 15:29:59 +0200
From: Pratyush Yadav <pratyush@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Pasha Tatashin <pasha.tatashin@...een.com>, Pratyush Yadav
<pratyush@...nel.org>, jasonmiu@...gle.com, graf@...zon.com,
changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com,
rientjes@...gle.com, corbet@....net, rdunlap@...radead.org,
ilpo.jarvinen@...ux.intel.com, kanie@...ux.alibaba.com,
ojeda@...nel.org, aliceryhl@...gle.com, masahiroy@...nel.org,
akpm@...ux-foundation.org, tj@...nel.org, yoann.congal@...le.fr,
mmaurer@...gle.com, roman.gushchin@...ux.dev, chenridong@...wei.com,
axboe@...nel.dk, mark.rutland@....com, jannh@...gle.com,
vincent.guittot@...aro.org, hannes@...xchg.org,
dan.j.williams@...el.com, david@...hat.com, joel.granados@...nel.org,
rostedt@...dmis.org, anna.schumaker@...cle.com, song@...nel.org,
zhangguopeng@...inos.cn, linux@...ssschuh.net,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
linux-mm@...ck.org, gregkh@...uxfoundation.org, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, rafael@...nel.org, dakr@...nel.org,
bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
myungjoo.ham@...sung.com, yesanishhere@...il.com,
Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com,
aleksander.lobakin@...el.com, ira.weiny@...el.com,
andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
stuart.w.hayes@...il.com, lennart@...ttering.net, brauner@...nel.org,
linux-api@...r.kernel.org, linux-fsdevel@...r.kernel.org,
saeedm@...dia.com, ajayachandra@...dia.com, parav@...dia.com,
leonro@...dia.com, witu@...dia.com, hughd@...gle.com,
skhawaja@...gle.com, chrisl@...nel.org, steven.sistare@...cle.com
Subject: Re: [PATCH v4 00/30] Live Update Orchestrator
On Fri, Oct 10 2025, Jason Gunthorpe wrote:
> On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
>> > This can look something like:
>> >
>> > hugetlb_luo_preserve_folio(folio, ...);
>> >
>> > Nice and simple.
>> >
>> > Compare this with the new proposed API:
>> >
>> > liveupdate_fh_global_state_get(h, &hugetlb_data);
>> > // This will have update serialized state now.
>> > hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>> > liveupdate_fh_global_state_put(h);
>> >
>> > We do the same thing but in a very complicated way.
>> >
>> > - When the system-wide preserve happens, the hugetlb subsystem gets a
>> > callback to serialize. It converts its runtime global state to
>> > serialized state since now it knows no more FDs will be added.
>> >
>> > With the new API, this doesn't need to be done since each FD prepare
>> > already updates serialized state.
>> >
>> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>> > anything in LUO. This is same as new API.
>> >
>> > - If some hugetlb FDs are not restored after liveupdate and the finish
>> > event is triggered, the subsystem gets its finish() handler called and
>> > it can free things up.
>> >
>> > I don't get how that would work with the new API.
>>
>> The new API isn't more complicated; It codifies the common pattern of
>> "create on first use, destroy on last use" into a reusable helper,
>> saving each file handler from having to reinvent the same reference
>> counting and locking scheme. But, as you point out, subsystems provide
>> more control, specifically they handle full creation/free instead of
>> relying on file-handlers for that.
>
> I'd say hugetlb *should* be doing the more complicated thing. We
> should not have global static data for luo floating around the kernel,
> this is too easily abused in bad ways.
Not sure how much difference this makes in practice, but I get your
point.
>
> The above "complicated" sequence forces the caller to have a fd
> session handle, and "hides" the global state inside luo so the
> subsystem can't just randomly reach into it whenever it likes.
>
> This is a deliberate and violent way to force clean coding practices
> and good layering.
>
> Not sure why hugetlb pools would need another xarray??
Not sure myself either. I used it to demonstrate my point of having
runtime state and serialized state separate from each other.
>
> 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
> frozen, can't add/remove PFNs.
Doesn't that circumvent LUO's state machine? The idea with the state
machine was to have clear points in time when the system goes into the
"limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
event. With what you propose, the first FD being preserved implicitly
triggers the prepare event. Same thing for unprepare/cancel operations.
I am wondering if it is better to do it the other way round: prepare all
files first, and then prepare the hugetlb subsystem at
LIVEUPDATE_PREPARE event. At that point it already knows which pages to
mark preserved so the serialization can be done in one go.
> 2) Require the users of hugetlb memory, like memfd, to
> preserve/restore the folios they are using (using their hugetlb order)
> 3) Just before kexec run over the PFN list and mark a bit if the folio
> was preserved by KHO or not. Make sure everything gets KHO
> preserved.
"just before kexec" would need a callback from LUO. I suppose a
subsystem is the place for that callback. I wrote my email under the
(wrong) impression that we were replacing subsystems.
That makes me wonder: how is the subsystem-level callback supposed to
access the global data? I suppose it can use the liveupdate_file_handler
directly, but it is kind of strange since technically the subsystem and
file handler are two different entities.
Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
I'm not sure how that would map with this shared global data. memfd and
guest_memfd will likely have different liveupdate_file_handler but would
share data from the same subsystem. Maybe that's a problem to solve for
later...
>
> Restore puts the PFNs that were not preserved directly in the free
> pool, the end user of the folio like the memfd restores and eventually
> normally frees the other folios.
Yeah, on the restore side this idea works fine I think.
>
> It is simple and fits nicely into the infrastructure here, where the
> first time you trigger a global state it does the pfn list and
> freezing, and the lifecycle and locking for this operation is directly
> managed by luo.
>
> The memfd, when it knows it has hugetlb folios inside it, would
> trigger this.
>
> Jason
--
Regards,
Pratyush Yadav
Powered by blists - more mailing lists