[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALzav=dhuoaS73ikufCf2D11Vq=jfMceYv0abdMxOdaHzmVR0g@mail.gmail.com>
Date: Thu, 26 Jun 2025 09:24:35 -0700
From: David Matlack <dmatlack@...gle.com>
To: Pratyush Yadav <pratyush@...nel.org>
Cc: Christian Brauner <brauner@...nel.org>, Pasha Tatashin <pasha.tatashin@...een.com>, jasonmiu@...gle.com,
graf@...zon.com, changyuanl@...gle.com, rppt@...nel.org, rientjes@...gle.com,
corbet@....net, rdunlap@...radead.org, ilpo.jarvinen@...ux.intel.com,
kanie@...ux.alibaba.com, ojeda@...nel.org, aliceryhl@...gle.com,
masahiroy@...nel.org, akpm@...ux-foundation.org, tj@...nel.org,
yoann.congal@...le.fr, mmaurer@...gle.com, roman.gushchin@...ux.dev,
chenridong@...wei.com, axboe@...nel.dk, mark.rutland@....com,
jannh@...gle.com, vincent.guittot@...aro.org, hannes@...xchg.org,
dan.j.williams@...el.com, david@...hat.com, joel.granados@...nel.org,
rostedt@...dmis.org, anna.schumaker@...cle.com, song@...nel.org,
zhangguopeng@...inos.cn, linux@...ssschuh.net, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, linux-mm@...ck.org, gregkh@...uxfoundation.org,
tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, rafael@...nel.org,
dakr@...nel.org, bartosz.golaszewski@...aro.org, cw00.choi@...sung.com,
myungjoo.ham@...sung.com, yesanishhere@...il.com, Jonathan.Cameron@...wei.com,
quic_zijuhu@...cinc.com, aleksander.lobakin@...el.com, ira.weiny@...el.com,
andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de,
bhelgaas@...gle.com, wagi@...nel.org, djeffery@...hat.com,
stuart.w.hayes@...il.com
Subject: Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@...nel.org> wrote:
>
> On Wed, Jun 25 2025, David Matlack wrote:
>
> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@...nel.org> wrote:
> >> >
> >> > While I agree that a filesystem offers superior introspection and
> >> > integration with standard tools, building this complex, stateful
> >> > orchestration logic on top of VFS seemed to be forcing a square peg
> >> > into a round hole. The ioctl interface, while more opaque, provides a
> >> > direct and explicit way to command the state machine and manage these
> >> > complex lifecycle and dependency rules.
> >>
> >> I'm not going to argue that you have to switch to this kexecfs idea
> >> but...
> >>
> >> You're using a character device that's tied to devmptfs. In other words,
> >> you're already using a filesystem interface. Literally the whole code
> >> here is built on top of filesystem APIs. So this argument is just very
> >> wrong imho. If you can built it on top of a character device using VFS
> >> interfaces you can do it as a minimal filesystem.
> >>
> >> You're free to define the filesystem interface any way you like it. We
> >> have a ton of examples there. All your ioctls would just be tied to the
> >> fileystem instance instead of the /dev/somethingsomething character
> >> device. The state machine could just be implemented the same way.
> >>
> >> One of my points is that with an fs interface you can have easy state
> >> seralization on a per-service level. IOW, you have a bunch of virtual
> >> machines running as services or some networking services or whatever.
> >> You could just bind-mount an instance of kexecfs into the service and
> >> the service can persist state into the instance and easily recover it
> >> after kexec.
> >
> > This approach sounds worth exploring more. It would avoid the need for
> > a centralized daemon to mediate the preservation and restoration of
> > all file descriptors.
>
> One of the jobs of the centralized daemon is to decide the _policy_ of
> who gets to preserve things and more importantly, make sure the right
> party unpreserves the right FDs after a kexec. I don't see how this
> interface fixes this problem. You would still need a way to identify
> which kexecfs instance belongs to who and enforce that. The kernel
> probably shouldn't be the one doing this kind of policy so you still
> need some userspace component to make those decisions.
The main benefits I see of kexecfs is that it avoids needing to send
all FDs over UDS to/from liveupdated and therefore the need for
dynamic cross-process communication (e.g. RPCs).
Instead, something just needs to set up a kexecfs for each VM when it
is created, and give the same kexecfs back to each VM after kexec.
Then VMs are free to save/restore any FDs in that kexecfs without
cross-process communication or transferring file descriptors.
Policy can be enforced by controlling access to kexecfs mounts. This
naturally fits into the standard architecture of running untrusted VMs
(e.g. using chroots and containers to enforce security and isolation).
>
> >
> > I'm not sure that we can get rid of the machine-wide state machine
> > though, as there is some kernel state that will necessarily cross
> > these kexecfs domains (e.g. IOMMU driver state). So we still might
> > need /dev/liveupdate for that.
>
> Generally speaking, I think both VFS-based and IOCTL-based interfaces
> are more or less equally expressive/powerful. Most of the ioctl
> operations can be translated to a VFS operation and vice versa.
>
> For example, the fsopen() call is similar to open("/dev/liveupdate") --
> both would create a live update session which auto closes when the FD is
> closed or FS unmounted. Similarly, each ioctl can be replaced with a
> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
> replaced with a fd_preserve file where you write() the FD number.
> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
> replaced by a "state" file where you can read() or write() the state.
>
> I think the main benefit of the VFS-based interface is ease of use.
> There already exist a bunch of utilites and libraries that we can use to
> interact with files. When we have ioctls, we would need to write
> everything ourselves. For example, instead of
> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
> easier to do.
>
> As for downsides, I think we might end up with a bit more boilerplate
> code, but beyond that I am not sure.
I agree we can more or less get to the same end state with either
approach. And also, I don't think we have to do one or the other. I
think kexecfs is something that we can build on top of this series.
For example, kexecfs would be a new kernel subsystem that registers
with LUO.
Powered by blists - more mailing lists