linux-kernel - Re: [RFC] Another take at restarting FUSE servers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d57bcfc5-fc3d-4635-ab46-0b9038fb7039@ddn.com>
Date: Wed, 5 Nov 2025 23:48:21 +0100
From: Bernd Schubert <bschubert@....com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Amir Goldstein <amir73il@...il.com>, Luis Henriques <luis@...lia.com>,
 Bernd Schubert <bernd@...ernd.com>, Theodore Ts'o <tytso@....edu>,
 Miklos Szeredi <miklos@...redi.hu>, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, Kevin Chen <kchen@....com>
Subject: Re: [RFC] Another take at restarting FUSE servers



On 11/5/25 23:42, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 11:24:01PM +0100, Bernd Schubert wrote:
>>
>>
>> On 11/4/25 14:10, Amir Goldstein wrote:
>>> On Tue, Nov 4, 2025 at 12:40 PM Luis Henriques <luis@...lia.com> wrote:
>>>>
>>>> On Tue, Sep 16 2025, Amir Goldstein wrote:
>>>>
>>>>> On Tue, Sep 16, 2025 at 4:53 AM Darrick J. Wong <djwong@...nel.org> wrote:
>>>>>>
>>>>>> On Mon, Sep 15, 2025 at 10:41:31AM +0200, Amir Goldstein wrote:
>>>>>>> On Mon, Sep 15, 2025 at 10:27 AM Bernd Schubert <bernd@...ernd.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/15/25 09:07, Amir Goldstein wrote:
>>>>>>>>> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@...nel.org> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>>>>>>>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@...ernd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>>>>>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>>>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>>>>>>>>>> could restart itself.  It's unclear if doing so will actually enable us
>>>>>>>>>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand.  So maybe restarts
>>>>>>>>>>>>>>>>> aren't totally crazy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm trying to understand what the failure scenario is here.  Is this
>>>>>>>>>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed?  If so, what
>>>>>>>>>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>>>>>>>>>> modifications which were in transit, etc.?  Sure, fuse2fs could run
>>>>>>>>>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>>>>>>>>>> potentally to be out of sync, right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>>>>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>>>>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>>>>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>>>>>>>>>> that were pending at the time.  It might be the case that you have to
>>>>>>>>>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>>>>>>>>>> to suspect that to be true.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>>>>>>>>>> the surviving inodes.  The ones that fail the GETATTR revalidation are
>>>>>>>>>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>>>>>>>>>> but probably GETATTR is a better option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, are you currently working on any of this?  Are you implementing this
>>>>>>>>>>>>>> new NOTIFY_RESTARTED request?  I guess it's time for me to have a closer
>>>>>>>>>>>>>> look at fuse2fs too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>>>>>>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>>>>>>>>>> DDN side.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>>>>>>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>>>>>>>>>> Now inode recovery might be hard, because we currently only have a
>>>>>>>>>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>>>>>>>>>> pointer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>>>>>>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>>>>>>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>>>>>>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>>>>>>>>>> open_by_handle_at doesn't work well right now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>>>>>>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>>>>>>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>>>>>>>>>> The file handles could be stored into the fuse inode and also used for
>>>>>>>>>>>>> NFS export.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>>>>>>>>>> Adding Amir to CC.
>>>>>>>>>>>>
>>>>>>>>>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>>>>>>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reference Amir! I even had been in that thread.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>>>>>>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>>>>>>>>>> Any objections against that?
>>>>>>>>>>
>>>>>>>>>> What if you actually /can/ reuse a nodeid after a restart?  Consider
>>>>>>>>>> fuse4fs, where the nodeid is the on-disk inode number.  After a restart,
>>>>>>>>>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>>>>>>>>>> didn't delete it, obviously.
>>>>>>>>>
>>>>>>>>> FUSE_LOOKUP_HANDLE is a contract.
>>>>>>>>> If fuse4fs can reuse nodeid after restart then by all means, it should sign
>>>>>>>>> this contract, otherwise there is no way for client to know that the
>>>>>>>>> nodeids are persistent.
>>>>>>>>> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
>>>>>>>>> API trivial.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I suppose you could just ask for refreshed stat information and either
>>>>>>>>>> the server gives it to you and the fuse_inode lives; or the server
>>>>>>>>>> returns ENOENT and then we mark it bad.  But I'd have to see code
>>>>>>>>>> patches to form a real opinion.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
>>>>>>>>> where fuse_instance_id can be its start time or random number.
>>>>>>>>> for auto invalidate, or maybe the fuse_instance_id should be
>>>>>>>>> a native part of FUSE protocol so that client knows to only invalidate
>>>>>>>>> attr cache in case of fuse_instance_id change?
>>>>>>>>>
>>>>>>>>> In any case, instead of a storm of revalidate messages after
>>>>>>>>> server restart, do it lazily on demand.
>>>>>>>>
>>>>>>>> For a network file system, probably. For fuse4fs or other block
>>>>>>>> based file systems, not sure. Darrick has the example of fsck.
>>>>>>>> Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
>>>>>>>> fuse-server gets restarted, fsck'ed and some files get removed.
>>>>>>>> Now reading these inodes would still work - wouldn't it
>>>>>>>> be better to invalidate the cache before going into operation
>>>>>>>> again?
>>>>>>>
>>>>>>> Forgive me, I was making a wrong assumption that fuse4fs
>>>>>>> was using ext4 filehandle as nodeid, but of course it does not.
>>>>>>
>>>>>> Well now that you mention it, there /is/ a risk of shenanigans like
>>>>>> that.  Consider:
>>>>>>
>>>>>> 1) fuse4fs mount an ext4 filesystem
>>>>>> 2) crash the fuse4fs server
>>>>>> <fuse4fs server restart stalls...>
>>>>>> 3) e2fsck -fy /dev/XXX deletes inode 17
>>>>>> 4) someone else mounts the fs, makes some changes that result in 17
>>>>>>    being reallocated, user says "OOOOOPS", unmounts it
>>>>>> 5) fuse4fs server finally restarts, and reconnects to the kernel
>>>>>>
>>>>>> Hey, inode 17 is now a different file!!
>>>>>>
>>>>>> So maybe the nodeid has to be an actual file handle.  Oh wait, no,
>>>>>> everything's (potentially) fine because fuse4fs supplied i_generation to
>>>>>> the kernel, and fuse_stale_inode will mark it bad if that happens.
>>>>>>
>>>>>> Hm ok then, at least there's a way out. :)
>>>>>>
>>>>>
>>>>> Right.
>>>>>
>>>>>>> The reason I made this wrong assumption is because fuse4fs *can*
>>>>>>> already use ext4 (64bit) file handle as nodeid, with existing FUSE protocol
>>>>>>> which is what my fuse passthough library [1] does.
>>>>>>>
>>>>>>> My claim was that although fuse4fs could support safe restart, which
>>>>>>> cannot read from recycled inode number with current FUSE protocol,
>>>>>>> doing so with FUSE_HANDLE protocol would express a commitment
>>>>>>
>>>>>> Pardon my naïvete, but what is FUSE_HANDLE?
>>>>>>
>>>>>> $ git grep -w FUSE_HANDLE fs
>>>>>> $
>>>>>
>>>>> Sorry, braino. I meant LOOKUP_HANDLE (or FUSE_LOOKUP_HANDLE):
>>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>>>
>>>>> Which means to communicate a variable sized "nodeid"
>>>>> which can also be declared as an object id that survives server restart.
>>>>>
>>>>> Basically, the reason that I brought up LOOKUP_HANDLE is to
>>>>> properly support NFS export of fuse filesystems.
>>>>>
>>>>> My incentive was to support a proper fuse server restart/remount/re-export
>>>>> with the same fsid in /etc/exports, but this gives us a better starting point
>>>>> for fuse server restart/re-connect.
>>>>
>>>> Sorry for resurrecting (again!) this discussion.  I've been thinking about
>>>> this, and trying to get some initial RFC for this LOOKUP_HANDLE operation.
>>>> However, I feel there are other operations that will need to return this
>>>> new handle.
>>>>
>>>> For example, the FUSE_CREATE (for atomic_open) also returns a nodeid.
>>>> Doesn't this means that, if the user-space server supports the new
>>>> LOOKUP_HANDLE, it should also return an handle in reply to the CREATE
>>>> request?
>>>
>>> Yes, I think that's what it means.
>>>
>>>> The same question applies for TMPFILE, LINK, etc.  Or is there
>>>> something special about the LOOKUP operation that I'm missing?
>>>>
>>>
>>> Any command returning fuse_entry_out.
>>>
>>> READDIRPLUS, MKNOD, MKDIR, SYMLINK
>>
>> Btw, checkout out <libfuse>/doc/libfuse-operations.txt for these
>> things. With double checking, though, the file was mostly created by AI
>> (just added a correction today). With that easy to see the missing
>> FUSE_TMPFILE.
>>
>>
>>>
>>> fuse_entry_out was extended once and fuse_reply_entry()
>>> sends the size of the struct.
>>
>> Sorry, I'm confused. Where does fuse_reply_entry() send the size?
>>
>>> However fuse_reply_create() sends it with fuse_open_out
>>> appended and fuse_add_direntry_plus() does not seem to write
>>> record size at all, so server and client will need to agree on the
>>> size of fuse_entry_out and this would need to be backward compat.
>>> If both server and client declare support for FUSE_LOOKUP_HANDLE
>>> it should be fine (?).
>>
>> If max_handle size becomes a value in fuse_init_out, server and
>> client would use it? I think appended fuse_open_out could just
>> follow the dynamic actual size of the handle - code that
>> serializes/deserializes the response has to look up the actual
>> handle size then. For example I wouldn't know what to put in
>> for any of the example/passthrough* file systems as handle size - 
>> would need to be 128B, but the actual size will be typically
>> much smaller.
> 
> name_to_handle_at ?
> 
> I guess the problem here is that technically speaking filesystems could
> have variable sized handles depending on the file.  Sometimes you encode
> just the ino/gen of the child file, but other times you might know the
> parent and put that in the handle too.

Yeah, I don't think it would be reliable for *all* file systems to use
name_to_handle_at on startup on some example file/directory. At least
not without knowing all the details of the underlying passthrough file
system.


Thanks,
Bernd