linux-kernel - Re: [PATCH v4 00/30] Live Update Orchestrator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mafs0o6qaltb2.fsf@kernel.org>
Date: Mon, 13 Oct 2025 17:23:13 +0200
From: Pratyush Yadav <pratyush@...nel.org>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: Pratyush Yadav <pratyush@...nel.org>,  jasonmiu@...gle.com,
  graf@...zon.com,  changyuanl@...gle.com,  rppt@...nel.org,
  dmatlack@...gle.com,  rientjes@...gle.com,  corbet@....net,
  rdunlap@...radead.org,  ilpo.jarvinen@...ux.intel.com,
  kanie@...ux.alibaba.com,  ojeda@...nel.org,  aliceryhl@...gle.com,
  masahiroy@...nel.org,  akpm@...ux-foundation.org,  tj@...nel.org,
  yoann.congal@...le.fr,  mmaurer@...gle.com,  roman.gushchin@...ux.dev,
  chenridong@...wei.com,  axboe@...nel.dk,  mark.rutland@....com,
  jannh@...gle.com,  vincent.guittot@...aro.org,  hannes@...xchg.org,
  dan.j.williams@...el.com,  david@...hat.com,  joel.granados@...nel.org,
  rostedt@...dmis.org,  anna.schumaker@...cle.com,  song@...nel.org,
  zhangguopeng@...inos.cn,  linux@...ssschuh.net,
  linux-kernel@...r.kernel.org,  linux-doc@...r.kernel.org,
  linux-mm@...ck.org,  gregkh@...uxfoundation.org,  tglx@...utronix.de,
  mingo@...hat.com,  bp@...en8.de,  dave.hansen@...ux.intel.com,
  x86@...nel.org,  hpa@...or.com,  rafael@...nel.org,  dakr@...nel.org,
  bartosz.golaszewski@...aro.org,  cw00.choi@...sung.com,
  myungjoo.ham@...sung.com,  yesanishhere@...il.com,
  Jonathan.Cameron@...wei.com,  quic_zijuhu@...cinc.com,
  aleksander.lobakin@...el.com,  ira.weiny@...el.com,
  andriy.shevchenko@...ux.intel.com,  leon@...nel.org,  lukas@...ner.de,
  bhelgaas@...gle.com,  wagi@...nel.org,  djeffery@...hat.com,
  stuart.w.hayes@...il.com,  lennart@...ttering.net,  brauner@...nel.org,
  linux-api@...r.kernel.org,  linux-fsdevel@...r.kernel.org,
  saeedm@...dia.com,  ajayachandra@...dia.com,  jgg@...dia.com,
  parav@...dia.com,  leonro@...dia.com,  witu@...dia.com,
  hughd@...gle.com,  skhawaja@...gle.com,  chrisl@...nel.org,
  steven.sistare@...cle.com
Subject: Re: [PATCH v4 00/30] Live Update Orchestrator

On Thu, Oct 09 2025, Pasha Tatashin wrote:

> On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@...nel.org> wrote:
>>
>> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>>
>> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
>> > <pasha.tatashin@...een.com> wrote:
>> >>
>> [...]
>> > 4. New File-Lifecycle-Bound Global State
>> > ----------------------------------------
>> > A new mechanism for managing global state was proposed, designed to be
>> > tied to the lifecycle of the preserved files themselves. This would
>> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
>> > global state that is only relevant when one or more of its FDs are
>> > being managed by LUO.
>>
>> Is this going to replace LUO subsystems? If yes, then why? The global
>> state will likely need to have its own lifecycle just like the FDs, and
>> subsystems are a simple and clean abstraction to control that. I get the
>> idea of only "activating" a subsystem when one or more of its FDs are
>> participating in LUO, but we can do that while keeping subsystems
>> around.
>
> Thanks for the feedback. The FLB Global State is not replacing the LUO
> subsystems. On the contrary, it's a higher-level abstraction that is
> itself implemented as a LUO subsystem. The goal is to provide a
> solution for a pattern that emerged during the PCI and IOMMU
> discussions.

Okay, makes sense then. I thought we were removing the subsystems idea.
I didn't follow the PCI and IOMMU discussions that closely.

Side note: I see a dependency between subsystems forming. For example,
the FLB subsystem probably wants to make sure all its dependent
subsystems (like LUO files) go through their callbacks before getting
its callback. Maybe in the current implementation doing it in any order
works, but in general, if it manages data of other subsystems, it should
be serialized after them.

Same with the hugetlb subsystem for example. On prepare or freeze time,
it would probably be a good idea if the files callbacks finish first. I
would imagine most subsystems would want to go after files.

With the current registration mechanism, the order depends on when the
subsystem is registered, which is hard to control. Maybe we should have
a global list of subsystems and can manually specify the order? Not sure
if that is a good idea, just throwing it out there off the top of my
head.

>
> You can see the WIP implementation here, which shows it registering as
> a subsystem named "luo-fh-states-v1-struct":
> https://github.com/soleen/linux/commit/94e191aab6b355d83633718bc4a1d27dda390001
>
> The existing subsystem API is a low-level tool that provides for the
> preservation of a raw 8-byte handle. It doesn't provide locking, nor
> is it explicitly tied to the lifecycle of any higher-level object like
> a file handler. The new API is designed to solve a more specific
> problem: allowing global components (like IOMMU or PCI) to
> automatically track when resources relevant to them are added to or
> removed from preservation. If HugeTLB requires a subsystem, it can
> still use it, but I suspect it might benefit from FLB Global State as
> well.

Hmm, right. Let me see how I can make use of it.

>
>> Here is how I imagine the proposed API would compare against subsystems
>> with hugetlb as an example (hugetlb support is still WIP, so I'm still
>> not clear on specifics, but this is how I imagine it will work):
>>
>> - Hugetlb subsystem needs to track its huge page pools and which pages
>>   are allocated and free. This is its global state. The pools get
>>   reconstructed after kexec. Post-kexec, the free pages are ready for
>>   allocation from other "regular" files and the pages used in LUO files
>>   are reserved.
>>
>> - Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
>>   in hugetlb's global data structure tracking this. This is runtime data
>>   (say xarray), and _not_ serialized data. Reason being, there are
>>   likely more FDs to come so no point in wasting time serializing just
>>   yet.
>>
>>   This can look something like:
>>
>>   hugetlb_luo_preserve_folio(folio, ...);
>>
>>   Nice and simple.
>>
>>   Compare this with the new proposed API:
>>
>>   liveupdate_fh_global_state_get(h, &hugetlb_data);
>>   // This will have update serialized state now.
>>   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>>   liveupdate_fh_global_state_put(h);
>>
>>   We do the same thing but in a very complicated way.
>>
>> - When the system-wide preserve happens, the hugetlb subsystem gets a
>>   callback to serialize. It converts its runtime global state to
>>   serialized state since now it knows no more FDs will be added.
>>
>>   With the new API, this doesn't need to be done since each FD prepare
>>   already updates serialized state.
>>
>> - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>>   anything in LUO. This is same as new API.
>>
>> - If some hugetlb FDs are not restored after liveupdate and the finish
>>   event is triggered, the subsystem gets its finish() handler called and
>>   it can free things up.
>>
>>   I don't get how that would work with the new API.
>
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.
>
>> My point is, I see subsystems working perfectly fine here and I don't
>> get how the proposed API is any better.
>>
>> Am I missing something?
>
> No, I don't think you are. Your analysis is correct that this is
> achievable with subsystems. The goal of the new API is to make that
> specific, common use case simpler.

Right. Thanks for clarifying.

>
> Pasha

-- 
Regards,
Pratyush Yadav