linux-kernel - Re: [RFC v2 10/16] luo: luo

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250625-akrobatisch-libellen-352997eb08ef@brauner>
Date: Wed, 25 Jun 2025 11:36:40 +0200
From: Christian Brauner <brauner@...nel.org>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: pratyush@...nel.org, jasonmiu@...gle.com, graf@...zon.com, 
	changyuanl@...gle.com, rppt@...nel.org, dmatlack@...gle.com, rientjes@...gle.com, 
	corbet@....net, rdunlap@...radead.org, ilpo.jarvinen@...ux.intel.com, 
	kanie@...ux.alibaba.com, ojeda@...nel.org, aliceryhl@...gle.com, masahiroy@...nel.org, 
	akpm@...ux-foundation.org, tj@...nel.org, yoann.congal@...le.fr, mmaurer@...gle.com, 
	roman.gushchin@...ux.dev, chenridong@...wei.com, axboe@...nel.dk, mark.rutland@....com, 
	jannh@...gle.com, vincent.guittot@...aro.org, hannes@...xchg.org, 
	dan.j.williams@...el.com, david@...hat.com, joel.granados@...nel.org, rostedt@...dmis.org, 
	anna.schumaker@...cle.com, song@...nel.org, zhangguopeng@...inos.cn, linux@...ssschuh.net, 
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, linux-mm@...ck.org, 
	gregkh@...uxfoundation.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, 
	dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com, rafael@...nel.org, 
	dakr@...nel.org, bartosz.golaszewski@...aro.org, cw00.choi@...sung.com, 
	myungjoo.ham@...sung.com, yesanishhere@...il.com, Jonathan.Cameron@...wei.com, 
	quic_zijuhu@...cinc.com, aleksander.lobakin@...el.com, ira.weiny@...el.com, 
	andriy.shevchenko@...ux.intel.com, leon@...nel.org, lukas@...ner.de, bhelgaas@...gle.com, 
	wagi@...nel.org, djeffery@...hat.com, stuart.w.hayes@...il.com, ptyadav@...zon.de
Subject: Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface

> > I'm not sure why people are so in love with character device based apis.
> > It's terrible. It glues everything to devtmpfs which isn't namespacable
> > in any way. It's terrible to delegate and extremely restrictive in terms
> > of extensiblity if you need additional device entries (aka the loop
> > driver folly).
> >
> > One stupid question: I probably have asked this before and just swapped
> > out that I a) asked this already and b) received an explanation. But why
> > isn't this a singleton simple in-memory filesystem with a flat
> > hierarchy?
> 
> Hi Christian,
> 
> Thank you for the detailed feedback and for raising this important

I don't know about detailed but no problem.

> design question. I appreciate the points you've made about the
> benefits of a filesystem-based API.
> 
> I have thought thoroughly about this and explored various alternatives
> before settling on the ioctl-based interface. This design isn't a
> sudden decision but is based on ongoing conversations that have been
> happening for over two years at LPC, as well as incorporating direct
> feedback I received on LUOv1 at LSF/MM.

Well, Mike mentioned that ultimately you want to interface this with
systemd? And we certainly have never been privy to any of these
uapi design conversations. Which is usually not a good sign...

> 
> The choice for an ioctl-based character device was ultimately driven
> by the specific lifecycle and dependency management requirements of
> the live update process. While a filesystem API offers great
> advantages in visibility and hierarchy, filesystems are not typically
> designed to be state machines with the complex lifecycle, dependency,
> and ownership tracking that LUO needs to manage.
> 
> Let me elaborate on the key aspects that led to the current design:
> 
> 1. session based lifecycle management: The preservation of an FD is
> tied to the open instance of /dev/liveupdate. If a userspace agent
> opens /dev/liveupdate, registers several FDs for preservation, and
> then crashes or exits before the prepare phase is triggered, all FDs
> it registered are automatically unregistered. This "session-scoped"
> behavior is crucial to prevent leaking preserved resources into the
> next kernel if the controlling process fails. This is naturally
> handled by the open() and release() file operations on a character
> device. It's not immediately obvious how a similar automatic,
> session-based cleanup would be implemented with a singleton
> filesystem.

fwiw

fd_context = fsopen("kexecfs")
fd_context = fsconfig(FSCONFIG_CMD_CREATE, ...)
fd_mnt = fsmount(fd_context, ...)

This gets you a private kexecfs instances that's never visible anywhere
in the filesystem hierarchy. When the fd is closed everything gets auto
cleaned up by the kernel. No need to umount or anything.

> 2. state machine: LUO is fundamentally a state machine (NORMAL ->
> PREPARED -> FROZEN -> UPDATED -> NORMAL). As part of this, it provides
> a crucial guarantee: any resource that was successfully preserved but
> not explicitly reclaimed by userspace in the new kernel by the time
> the FINISH event is triggered will be automatically cleaned up and its
> memory released. This prevents leaks of unreclaimed resources and is
> managed by the orchestrator, which is a concept that doesn't map
> cleanly onto standard VFS semantics.

I'm not following this. See above. And also any umount can trivially
just destroy whatever resource is still left in the filesystem.

> 
> 3. dependency tracking: Unlike normal files, preserved resources for
> live update have strong, often complex interdependencies. For example,
> a kvmfd might depend on a guestmemfd; an iommufd can depend on vfiofd,
> eventfd, memfd, and kvmfd. LUO's current design provides explicit
> callback points (prepare, freeze) where these dependencies can be
> validated and tracked by the participating subsystems. If a dependency
> is not met when we are about to freeze, we can fail the entire
> operation and return an error to userspace. The cancel callback
> further allows this complex dependency graph to be unwound safely. A
> filesystem interface based on linkat() or unlink() doesn't inherently
> provide these critical, ordered points for dependency verification and
> rollback.
> 
> While I agree that a filesystem offers superior introspection and
> integration with standard tools, building this complex, stateful
> orchestration logic on top of VFS seemed to be forcing a square peg
> into a round hole. The ioctl interface, while more opaque, provides a
> direct and explicit way to command the state machine and manage these
> complex lifecycle and dependency rules.

I'm not going to argue that you have to switch to this kexecfs idea
but...

You're using a character device that's tied to devmptfs. In other words,
you're already using a filesystem interface. Literally the whole code
here is built on top of filesystem APIs. So this argument is just very
wrong imho. If you can built it on top of a character device using VFS
interfaces you can do it as a minimal filesystem.

You're free to define the filesystem interface any way you like it. We
have a ton of examples there. All your ioctls would just be tied to the
fileystem instance instead of the /dev/somethingsomething character
device. The state machine could just be implemented the same way.

One of my points is that with an fs interface you can have easy state
seralization on a per-service level. IOW, you have a bunch of virtual
machines running as services or some networking services or whatever.
You could just bind-mount an instance of kexecfs into the service and
the service can persist state into the instance and easily recover it
after kexec.

But anyway, you seem to be set on the ioctl() interface, fine.