[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <mafs0y0owd187.fsf@kernel.org>
Date: Mon, 27 Oct 2025 12:37:44 +0100
From: Pratyush Yadav <pratyush@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Pratyush Yadav <pratyush@...nel.org>, Pasha Tatashin
<pasha.tatashin@...een.com>, jasonmiu@...gle.com, graf@...zon.com,
changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com,
rientjes@...gle.com, corbet@....net, rdunlap@...radead.org,
ilpo.jarvinen@...ux.intel.com, kanie@...ux.alibaba.com,
ojeda@...nel.org, aliceryhl@...gle.com, masahiroy@...nel.org,
akpm@...ux-foundation.org, tj@...nel.org, yoann.congal@...le.fr,
mmaurer@...gle.com, roman.gushchin@...ux.dev, chenridong@...wei.com,
axboe@...nel.dk, mark.rutland@....com, jannh@...gle.com,
vincent.guittot@...aro.org, hannes@...xchg.org,
dan.j.williams@...el.com, david@...hat.com, joel.granados@...nel.org,
rostedt@...dmis.org, anna.schumaker@...cle.com, song@...nel.org,
zhangguopeng@...inos.cn, linux@...ssschuh.net,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
linux-mm@...ck.org, gregkh@...uxfoundation.org, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, rafael@...nel.org, dakr@...nel.org,
bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
myungjoo.ham@...sung.com, yesanishhere@...il.com,
Jonathan.Cameron@...wei.com, quic_zijuhu@...cinc.com,
aleksander.lobakin@...el.com, ira.weiny@...el.com,
andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
stuart.w.hayes@...il.com, lennart@...ttering.net, brauner@...nel.org,
linux-api@...r.kernel.org, linux-fsdevel@...r.kernel.org,
saeedm@...dia.com, ajayachandra@...dia.com, parav@...dia.com,
leonro@...dia.com, witu@...dia.com, hughd@...gle.com,
skhawaja@...gle.com, chrisl@...nel.org, steven.sistare@...cle.com
Subject: Re: [PATCH v4 00/30] Live Update Orchestrator
On Mon, Oct 20 2025, Jason Gunthorpe wrote:
> On Tue, Oct 14, 2025 at 03:29:59PM +0200, Pratyush Yadav wrote:
>> > 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
>> > frozen, can't add/remove PFNs.
>>
>> Doesn't that circumvent LUO's state machine? The idea with the state
>> machine was to have clear points in time when the system goes into the
>> "limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
>> event.
>
> I wouldn't get too invested in the FSM, it is there but it doesn't
> mean every luo client has to be focused on it.
Having each subsystem have its own state machine sounds like a bad idea
to me. It can get tricky to manage both for us and our users.
>
>> With what you propose, the first FD being preserved implicitly
>> triggers the prepare event. Same thing for unprepare/cancel operations.
>
> Yes, this is easy to write and simple to manage.
>
>> I am wondering if it is better to do it the other way round: prepare all
>> files first, and then prepare the hugetlb subsystem at
>> LIVEUPDATE_PREPARE event. At that point it already knows which pages to
>> mark preserved so the serialization can be done in one go.
>
> I think this would be slower and more complex?
>
>> > 2) Require the users of hugetlb memory, like memfd, to
>> > preserve/restore the folios they are using (using their hugetlb order)
>> > 3) Just before kexec run over the PFN list and mark a bit if the folio
>> > was preserved by KHO or not. Make sure everything gets KHO
>> > preserved.
>>
>> "just before kexec" would need a callback from LUO. I suppose a
>> subsystem is the place for that callback. I wrote my email under the
>> (wrong) impression that we were replacing subsystems.
>
> The file descriptors path should have luo client ops that have all
> the required callbacks. This is probably an existing op.
>
>> That makes me wonder: how is the subsystem-level callback supposed to
>> access the global data? I suppose it can use the liveupdate_file_handler
>> directly, but it is kind of strange since technically the subsystem and
>> file handler are two different entities.
>
> If we need such things we would need a way to link these together, but
> I'm wonder if we really don't..
>
>> Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
>> I'm not sure how that would map with this shared global data. memfd and
>> guest_memfd will likely have different liveupdate_file_handler but would
>> share data from the same subsystem. Maybe that's a problem to solve for
>> later...
>
> On preserve memfd should call into hugetlb to activate it as a hugetlb
> page provider and preserve it too.
>From what I understand, the main problem you want to solve is that the
life cycle of the global data should be tied to the file descriptors.
And since everything should have a FD anyway, can't we directly tie the
subsystems to file handlers? The subsystem gets a "preserve" callback
when the first FD that uses it gets preserved. It gets a "unpreserve"
callback when the last FD goes away. And the rest of the state machine
like prepare, cancel, etc. stay the same.
I think this gives us a clean abstraction that has LUO-managed lifetime.
It also works with the guest_memfd and memfd case since both can have
hugetlb as their underlying subsystem. For example,
static const struct liveupdate_file_ops memfd_luo_file_ops = {
.preserve = memfd_luo_preserve,
.unpreserve = memfd_luo_unpreserve,
[...]
.subsystem = &luo_hugetlb_subsys,
};
And then luo_{un,}preserve_file() can keep a refcount for the subsystem
and preserve or unpreserve the subsystem as needed. LUO can manage the
locking for these callbacks too.
--
Regards,
Pratyush Yadav
Powered by blists - more mailing lists