linux-kernel - Re: [RFC] Another take at restarting FUSE servers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxjZ0B5TwV+HiWsUpBuFuZJZ_e4Bm_QfNn4crDoVAfkA9Q@mail.gmail.com>
Date: Wed, 5 Nov 2025 11:21:14 +0100
From: Amir Goldstein <amir73il@...il.com>
To: Luis Henriques <luis@...lia.com>
Cc: "Darrick J. Wong" <djwong@...nel.org>, Bernd Schubert <bernd@...ernd.com>, "Theodore Ts'o" <tytso@....edu>, 
	Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Kevin Chen <kchen@....com>
Subject: Re: [RFC] Another take at restarting FUSE servers

On Tue, Nov 4, 2025 at 3:52 PM Luis Henriques <luis@...lia.com> wrote:
>
> On Tue, Nov 04 2025, Amir Goldstein wrote:
>
> > On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@...lia.com> wrote:
> >>
> >> On Tue, Sep 16 2025, Amir Goldstein wrote:
> >>
> >> > On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@...nel.org> wrote:
> >> >>
> >> >> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
> >> >> > On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@...ernd.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 9/15/25 09:07, Amir Goldstein wrote:
> >> >> > > > On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >> >> > > >>
> >> >> > > >> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >> >> > > >>>
> >> >> > > >>>
> >> >> > > >>> On 9/12/25 13:41, Amir Goldstein wrote:
> >> >> > > >>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@...ernd.com> wrote:
> >> >> > > >>>>>
> >> >> > > >>>>>
> >> >> > > >>>>>
> >> >> > > >>>>> On 8/1/25 12:15, Luis Henriques wrote:
> >> >> > > >>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> >> >> > > >>>>>>
> >> >> > > >>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> >> >> > > >>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> >> >> > > >>>>>>>>>
> >> >> > > >>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> >> >> > > >>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
> >> >> > > >>>>>>>>> to clear the condition that caused the failure in the first place, but I
> >> >> > > >>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
> >> >> > > >>>>>>>>> aren't totally crazy.
> >> >> > > >>>>>>>>
> >> >> > > >>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
> >> >> > > >>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
> >> >> > > >>>>>>>> is supposed to happen with respect to open files, metadata and data
> >> >> > > >>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
> >> >> > > >>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> >> >> > > >>>>>>>> potentally to be out of sync, right?
> >> >> > > >>>>>>>>
> >> >> > > >>>>>>>> What are the recovery semantics that we hope to be able to provide?
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> <echoing what we said on the ext4 call this morning>
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
> >> >> > > >>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> >> >> > > >>>>>>> would initiate GETATTR requests on all the cached inodes to validate
> >> >> > > >>>>>>> that they still exist; and then resend all the unacknowledged requests
> >> >> > > >>>>>>> that were pending at the time.  It might be the case that you have to
> >> >> > > >>>>>>> that in the reverse order; I only know enough about the design of fuse
> >> >> > > >>>>>>> to suspect that to be true.
> >> >> > > >>>>>>>
> >> >> > > >>>>>>> Anyhow once those are complete, I think we can resume operations with
> >> >> > > >>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
> >> >> > > >>>>>>> fuse_make_bad'd, which effectively revokes them.
> >> >> > > >>>>>>
> >> >> > > >>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> >> >> > > >>>>>> but probably GETATTR is a better option.
> >> >> > > >>>>>>
> >> >> > > >>>>>> So, are you currently working on any of this?  Are you implementing this
> >> >> > > >>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
> >> >> > > >>>>>> look at fuse2fs too.
> >> >> > > >>>>>
> >> >> > > >>>>> Sorry for joining the discussion late, I was totally occupied, day and
> >> >> > > >>>>> night. Added Kevin to CC, who is going to work on recovery on our
> >> >> > > >>>>> DDN side.
> >> >> > > >>>>>
> >> >> > > >>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> >> >> > > >>>>> server restart we want kernel to recover inodes and their lookup count.
> >> >> > > >>>>> Now inode recovery might be hard, because we currently only have a
> >> >> > > >>>>> 64-bit node-id - which is used my most fuse application as memory
> >> >> > > >>>>> pointer.
> >> >> > > >>>>>
> >> >> > > >>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> >> >> > > >>>>> outstanding requests. And that ends up in most cases in sending requests
> >> >> > > >>>>> with invalid node-IDs, that are casted and might provoke random memory
> >> >> > > >>>>> access on restart. Kind of the same issue why fuse nfs export or
> >> >> > > >>>>> open_by_handle_at doesn't work well right now.
> >> >> > > >>>>>
> >> >> > > >>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> >> >> > > >>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
> >> >> > > >>>>> And then FUSE_REVALIDATE_FH on server restart.
> >> >> > > >>>>> The file handles could be stored into the fuse inode and also used for
> >> >> > > >>>>> NFS export.
> >> >> > > >>>>>
> >> >> > > >>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
> >> >> > > >>>>> Adding Amir to CC.
> >> >> > > >>>>
> >> >> > > >>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> >> >> > > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >> >> > > >>>
> >> >> > > >>> Thanks for the reference Amir! I even had been in that thread.
> >> >> > > >>>
> >> >> > > >>>>
> >> >> > > >>>>>
> >> >> > > >>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> >> >> > > >>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
> >> >> > > >>>>> Any objections against that?
> >> >> > > >>
> >> >> > > >> What if you actually /can/ reuse a nodeid after a restart?  Consider
> >> >> > > >> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
> >> >> > > >> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> >> >> > > >> didn't delete it, obviously.
> >> >> > > >
> >> >> > > > FUSE_LOOKUP_HANDLE is a contract.
> >> >> > > > If fuse4fs can reuse nodeid after restart then by all means, it should sign
> >> >> > > > this contract, otherwise there is no way for client to know that the
> >> >> > > > nodeids are persistent.
> >> >> > > > If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> >> >> > > > API trivial.
> >> >> > > >
> >> >> > > >>
> >> >> > > >> I suppose you could just ask for refreshed stat information and either
> >> >> > > >> the server gives it to you and the fuse_inode lives; or the server
> >> >> > > >> returns ENOENT and then we mark it bad.  But I'd have to see code
> >> >> > > >> patches to form a real opinion.
> >> >> > > >>
> >> >> > > >
> >> >> > > > You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> >> >> > > > where fuse_instance_id can be its start time or random number.
> >> >> > > > for auto invalidate, or maybe the fuse_instance_id should be
> >> >> > > > a native part of FUSE protocol so that client knows to only invalidate
> >> >> > > > attr cache in case of fuse_instance_id change?
> >> >> > > >
> >> >> > > > In any case, instead of a storm of revalidate messages after
> >> >> > > > server restart, do it lazily on demand.
> >> >> > >
> >> >> > > For a network file system, probably. For fuse4fs or other block
> >> >> > > based file systems, not sure. Darrick has the example of fsck.
> >> >> > > Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
> >> >> > > fuse-server gets restarted, fsck'ed and some files get removed.
> >> >> > > Now reading these inodes would still work - wouldn't it
> >> >> > > be better to invalidate the cache before going into operation
> >> >> > > again?
> >> >> >
> >> >> > Forgive me, I was making a wrong assumption that fuse4fs
> >> >> > was using ext4 filehandle as nodeid, but of course it does not.
> >> >>
> >> >> Well now that you mention it, there /is/ a risk of shenanigans like
> >> >> that.  Consider:
> >> >>
> >> >> 1) fuse4fs mount an ext4 filesystem
> >> >> 2) crash the fuse4fs server
> >> >> <fuse4fs server restart stalls...>
> >> >> 3) e2fsck -fy /dev/XXX deletes inode 17
> >> >> 4) someone else mounts the fs, makes some changes that result in 17
> >> >>    being reallocated, user says "OOOOOPS", unmounts it
> >> >> 5) fuse4fs server finally restarts, and reconnects to the kernel
> >> >>
> >> >> Hey, inode 17 is now a different file!!
> >> >>
> >> >> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
> >> >> everything's (potentially) fine because fuse4fs supplied i_generation to
> >> >> the kernel, and fuse_stale_inode will mark it bad if that happens.
> >> >>
> >> >> Hm ok then, at least there's a way out. :)
> >> >>
> >> >
> >> > Right.
> >> >
> >> >> > The reason I made this wrong assumption is because fuse4fs *can*
> >> >> > already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
> >> >> > which is what my fuse passthough library [1] does.
> >> >> >
> >> >> > My claim was that although fuse4fs could support safe restart, which
> >> >> > cannot read from recycled inode number with current FUSE protocol,
> >> >> > doing so with FUSE_HANDLE protocol would express a commitment
> >> >>
> >> >> Pardon my naïvete, but what is FUSE_HANDLE?
> >> >>
> >> >> $ git grep -w FUSE_HANDLE fs
> >> >> $
> >> >
> >> > Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
> >> > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >> >
> >> > Which means to communicate a variable sized "nodeid"
> >> > which can also be declared as an object id that survives server restart.
> >> >
> >> > Basically, the reason that I brought up LOOKUP_HANDLE is to
> >> > properly support NFS export of fuse filesystems.
> >> >
> >> > My incentive was to support a proper fuse server restart/remount/re-export
> >> > with the same fsid in /etc/exports, but this gives us a better starting point
> >> > for fuse server restart/re-connect.
> >>
> >> Sorry for resurrecting (again!) this discussion.  I've been thinking about
> >> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
> >> However, I feel there are other operations that will need to return this
> >> new handle.
> >>
> >> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
> >> Doesn't this means that, if the user-space server supports the new
> >> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
> >> request?
> >
> > Yes, I think that's what it means.
>
> Awesome, thank you for confirming this.
>
> >> The same question applies for TMPFILE, LINK, etc.  Or is there
> >> something special about the LOOKUP operation that I'm missing?
> >>
> >
> > Any command returning fuse_entry_out.
> >
> > READDIRPLUS, MKNOD, MKDIR, SYMLINK
>
> Right, I had this list, but totally missed READDIRPLUS.
>
> > fuse_entry_out was extended once and fuse_reply_entry()
> > sends the size of the struct.
>
> So, if I'm understanding you correctly, you're suggesting to extend
> fuse_entry_out to add the new handle (a 'size' field + the actual handle).

Well it depends...

There are several ways to do it.
I would really like to get Miklos and Bernd's opinion on the preferred way.

So far, it looks like the client determines the size of the output args.

If we want the server to be able to write a different file handle size
per inode that's going to be a bigger challenge.

I think it's plenty enough if server and client negotiate a max file handle
size and then the client always reserves enough space in the output
args buffer.

One more thing to ask is what is "the actual handle".
If "the actual handle" is the variable sized struct file_handle then
the size is already available in the file handle header.
If it is not, then I think some sort of type or version of the file handles
encoding should be negotiated beyond the max handle size.

> That's probably a good idea.  I was working towards having the
> LOOKUP_HANDLE to be similar to LOOKUP, but extending it so that it would
> include:
>
>  - An extra inarg: the parent directory handle.  (To be honest, I'm not
>    really sure this would be needed.)

Yes, I think you need extra inarg.
Why would it not be needed?
The problem is that you cannot know if the parent node id in the lookup
command is stale after server restart.

The thing is that the kernel fuse inode will need to store the file handle,
much the same as an NFS client stores the file handle provided by the
NFS server.

FYI, fanotify has an optimized way to store file handles in
struct fanotify_fid_event - small file handles are stored inline
and larger file handles can use an external buffer.

But fuse does not need to support any size of file handles.
For first version we could definitely simplify things by limiting the size
of supported file handles, because server and client need to negotiate
the max file handle size anyway.

>  - An extra outarg: for the actual handle.
>
> With your suggestion, only the extra inarg would be required.
>

Yes, either extra arg or just an extended size of fuse_entry_out
negotiated at init time.

TBH it seems cleaner to add 2nd outarg to all the commands,
but CREATE already has a 2nd arg and 2nd arg does not solve
READDIRPLUS.

> > However fuse_reply_create() sends it with fuse_open_out
> > appended
>
> This one should be fine...
>
> > and fuse_add_direntry_plus() does not seem to write
> > record size at all, so server and client will need to agree on the
> > size of fuse_entry_out and this would need to be backward compat.
> > If both server and client declare support for FUSE_LOOKUP_HANDLE
> > it should be fine (?).
>
> ... yeah, this could be a bit trickier.  But I'll need to go look into it.
>
> Thanks a lot for your comments, Amir.  I was trying to get an RFC out
> soon(ish) to get early feedback, hoping to prevent me following wrong
> paths.
>

Disclaimer, following my advice may well lead you down wrong paths..
Best to wait for confirmation from Miklos and Bernd if you want to have
more certainty...

Thanks,
Amir.