[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2e1db15f-b2b1-487f-9f42-44dc7480b2e2@bsbernd.com>
Date: Mon, 15 Sep 2025 10:27:04 +0200
From: Bernd Schubert <bernd@...ernd.com>
To: Amir Goldstein <amir73il@...il.com>, "Darrick J. Wong" <djwong@...nel.org>
Cc: Luis Henriques <luis@...lia.com>, Theodore Ts'o <tytso@....edu>,
Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Kevin Chen <kchen@....com>
Subject: Re: [RFC] Another take at restarting FUSE servers
On 9/15/25 09:07, Amir Goldstein wrote:
> On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@...nel.org> wrote:
>>
>> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
>>>
>>>
>>> On 9/12/25 13:41, Amir Goldstein wrote:
>>>> On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@...ernd.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 8/1/25 12:15, Luis Henriques wrote:
>>>>>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
>>>>>>
>>>>>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
>>>>>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
>>>>>>>>>
>>>>>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
>>>>>>>>> could restart itself. It's unclear if doing so will actually enable us
>>>>>>>>> to clear the condition that caused the failure in the first place, but I
>>>>>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts
>>>>>>>>> aren't totally crazy.
>>>>>>>>
>>>>>>>> I'm trying to understand what the failure scenario is here. Is this
>>>>>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what
>>>>>>>> is supposed to happen with respect to open files, metadata and data
>>>>>>>> modifications which were in transit, etc.? Sure, fuse2fs could run
>>>>>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
>>>>>>>> potentally to be out of sync, right?
>>>>>>>>
>>>>>>>> What are the recovery semantics that we hope to be able to provide?
>>>>>>>
>>>>>>> <echoing what we said on the ext4 call this morning>
>>>>>>>
>>>>>>> With iomap, most of the dirty state is in the kernel, so I think the new
>>>>>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
>>>>>>> would initiate GETATTR requests on all the cached inodes to validate
>>>>>>> that they still exist; and then resend all the unacknowledged requests
>>>>>>> that were pending at the time. It might be the case that you have to
>>>>>>> that in the reverse order; I only know enough about the design of fuse
>>>>>>> to suspect that to be true.
>>>>>>>
>>>>>>> Anyhow once those are complete, I think we can resume operations with
>>>>>>> the surviving inodes. The ones that fail the GETATTR revalidation are
>>>>>>> fuse_make_bad'd, which effectively revokes them.
>>>>>>
>>>>>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
>>>>>> but probably GETATTR is a better option.
>>>>>>
>>>>>> So, are you currently working on any of this? Are you implementing this
>>>>>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer
>>>>>> look at fuse2fs too.
>>>>>
>>>>> Sorry for joining the discussion late, I was totally occupied, day and
>>>>> night. Added Kevin to CC, who is going to work on recovery on our
>>>>> DDN side.
>>>>>
>>>>> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
>>>>> server restart we want kernel to recover inodes and their lookup count.
>>>>> Now inode recovery might be hard, because we currently only have a
>>>>> 64-bit node-id - which is used my most fuse application as memory
>>>>> pointer.
>>>>>
>>>>> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
>>>>> outstanding requests. And that ends up in most cases in sending requests
>>>>> with invalid node-IDs, that are casted and might provoke random memory
>>>>> access on restart. Kind of the same issue why fuse nfs export or
>>>>> open_by_handle_at doesn't work well right now.
>>>>>
>>>>> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
>>>>> would not return a 64-bit node ID, but a max 128 byte file handle.
>>>>> And then FUSE_REVALIDATE_FH on server restart.
>>>>> The file handles could be stored into the fuse inode and also used for
>>>>> NFS export.
>>>>>
>>>>> I *think* Amir had a similar idea, but I don't find the link quickly.
>>>>> Adding Amir to CC.
>>>>
>>>> Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
>>>
>>> Thanks for the reference Amir! I even had been in that thread.
>>>
>>>>
>>>>>
>>>>> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
>>>>> will iterate over all superblock inodes and mark them with fuse_make_bad.
>>>>> Any objections against that?
>>
>> What if you actually /can/ reuse a nodeid after a restart? Consider
>> fuse4fs, where the nodeid is the on-disk inode number. After a restart,
>> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
>> didn't delete it, obviously.
>
> FUSE_LOOKUP_HANDLE is a contract.
> If fuse4fs can reuse nodeid after restart then by all means, it should sign
> this contract, otherwise there is no way for client to know that the
> nodeids are persistent.
> If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
> API trivial.
>
>>
>> I suppose you could just ask for refreshed stat information and either
>> the server gives it to you and the fuse_inode lives; or the server
>> returns ENOENT and then we mark it bad. But I'd have to see code
>> patches to form a real opinion.
>>
>
> You could make fuse4fs_handle := <nodeid:fuse_instance_id>
> where fuse_instance_id can be its start time or random number.
> for auto invalidate, or maybe the fuse_instance_id should be
> a native part of FUSE protocol so that client knows to only invalidate
> attr cache in case of fuse_instance_id change?
>
> In any case, instead of a storm of revalidate messages after
> server restart, do it lazily on demand.
For a network file system, probably. For fuse4fs or other block
based file systems, not sure. Darrick has the example of fsck.
Let's assume fuse4fs runs with attribute and dentry timeouts > 0,
fuse-server gets restarted, fsck'ed and some files get removed.
Now reading these inodes would still work - wouldn't it
be better to invalidate the cache before going into operation
again?
Thanks,
Bernd
Powered by blists - more mailing lists