[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxhm3=P-kJn3Liu67bhhMODZOM7AUSLFJRiy_neuz6g80g@mail.gmail.com>
Date: Mon, 15 Sep 2025 09:07:34 +0200
From: Amir Goldstein <amir73il@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Bernd Schubert <bernd@...ernd.com>, Luis Henriques <luis@...lia.com>, "Theodore Ts'o" <tytso@....edu>,
Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, Kevin Chen <kchen@....com>
Subject: Re: [RFC] Another take at restarting FUSE servers
On Fri, Sep 12, 2025 at 4:58 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Fri, Sep 12, 2025 at 02:29:03PM +0200, Bernd Schubert wrote:
> >
> >
> > On 9/12/25 13:41, Amir Goldstein wrote:
> > > On Fri, Sep 12, 2025 at 12:31 PM Bernd Schubert <bernd@...ernd.com> wrote:
> > >>
> > >>
> > >>
> > >> On 8/1/25 12:15, Luis Henriques wrote:
> > >>> On Thu, Jul 31 2025, Darrick J. Wong wrote:
> > >>>
> > >>>> On Thu, Jul 31, 2025 at 09:04:58AM -0400, Theodore Ts'o wrote:
> > >>>>> On Tue, Jul 29, 2025 at 04:38:54PM -0700, Darrick J. Wong wrote:
> > >>>>>>
> > >>>>>> Just speaking for fuse2fs here -- that would be kinda nifty if libfuse
> > >>>>>> could restart itself. It's unclear if doing so will actually enable us
> > >>>>>> to clear the condition that caused the failure in the first place, but I
> > >>>>>> suppose fuse2fs /does/ have e2fsck -fy at hand. So maybe restarts
> > >>>>>> aren't totally crazy.
> > >>>>>
> > >>>>> I'm trying to understand what the failure scenario is here. Is this
> > >>>>> if the userspace fuse server (i.e., fuse2fs) has crashed? If so, what
> > >>>>> is supposed to happen with respect to open files, metadata and data
> > >>>>> modifications which were in transit, etc.? Sure, fuse2fs could run
> > >>>>> e2fsck -fy, but if there are dirty inode on the system, that's going
> > >>>>> potentally to be out of sync, right?
> > >>>>>
> > >>>>> What are the recovery semantics that we hope to be able to provide?
> > >>>>
> > >>>> <echoing what we said on the ext4 call this morning>
> > >>>>
> > >>>> With iomap, most of the dirty state is in the kernel, so I think the new
> > >>>> fuse2fs instance would poke the kernel with FUSE_NOTIFY_RESTARTED, which
> > >>>> would initiate GETATTR requests on all the cached inodes to validate
> > >>>> that they still exist; and then resend all the unacknowledged requests
> > >>>> that were pending at the time. It might be the case that you have to
> > >>>> that in the reverse order; I only know enough about the design of fuse
> > >>>> to suspect that to be true.
> > >>>>
> > >>>> Anyhow once those are complete, I think we can resume operations with
> > >>>> the surviving inodes. The ones that fail the GETATTR revalidation are
> > >>>> fuse_make_bad'd, which effectively revokes them.
> > >>>
> > >>> Ah! Interesting, I have been playing a bit with sending LOOKUP requests,
> > >>> but probably GETATTR is a better option.
> > >>>
> > >>> So, are you currently working on any of this? Are you implementing this
> > >>> new NOTIFY_RESTARTED request? I guess it's time for me to have a closer
> > >>> look at fuse2fs too.
> > >>
> > >> Sorry for joining the discussion late, I was totally occupied, day and
> > >> night. Added Kevin to CC, who is going to work on recovery on our
> > >> DDN side.
> > >>
> > >> Issue with GETATTR and LOOKUP is that they need a path, but on fuse
> > >> server restart we want kernel to recover inodes and their lookup count.
> > >> Now inode recovery might be hard, because we currently only have a
> > >> 64-bit node-id - which is used my most fuse application as memory
> > >> pointer.
> > >>
> > >> As Luis wrote, my issue with FUSE_NOTIFY_RESEND is that it just re-sends
> > >> outstanding requests. And that ends up in most cases in sending requests
> > >> with invalid node-IDs, that are casted and might provoke random memory
> > >> access on restart. Kind of the same issue why fuse nfs export or
> > >> open_by_handle_at doesn't work well right now.
> > >>
> > >> So IMHO, what we really want is something like FUSE_LOOKUP_FH, which
> > >> would not return a 64-bit node ID, but a max 128 byte file handle.
> > >> And then FUSE_REVALIDATE_FH on server restart.
> > >> The file handles could be stored into the fuse inode and also used for
> > >> NFS export.
> > >>
> > >> I *think* Amir had a similar idea, but I don't find the link quickly.
> > >> Adding Amir to CC.
> > >
> > > Or maybe it was Miklos' idea. Hard to keep track of this rolling thread:
> > > https://lore.kernel.org/linux-fsdevel/CAJfpegvNZ6Z7uhuTdQ6quBaTOYNkAP8W_4yUY4L2JRAEKxEwOQ@mail.gmail.com/
> >
> > Thanks for the reference Amir! I even had been in that thread.
> >
> > >
> > >>
> > >> Our short term plan is to add something like FUSE_NOTIFY_RESTART, which
> > >> will iterate over all superblock inodes and mark them with fuse_make_bad.
> > >> Any objections against that?
>
> What if you actually /can/ reuse a nodeid after a restart? Consider
> fuse4fs, where the nodeid is the on-disk inode number. After a restart,
> you can reconnect the fuse_inode to the ondisk inode, assuming recovery
> didn't delete it, obviously.
FUSE_LOOKUP_HANDLE is a contract.
If fuse4fs can reuse nodeid after restart then by all means, it should sign
this contract, otherwise there is no way for client to know that the
nodeids are persistent.
If fuse4fs_handle := nodeid, that will make implementing the lookup_handle()
API trivial.
>
> I suppose you could just ask for refreshed stat information and either
> the server gives it to you and the fuse_inode lives; or the server
> returns ENOENT and then we mark it bad. But I'd have to see code
> patches to form a real opinion.
>
You could make fuse4fs_handle := <nodeid:fuse_instance_id>
where fuse_instance_id can be its start time or random number.
for auto invalidate, or maybe the fuse_instance_id should be
a native part of FUSE protocol so that client knows to only invalidate
attr cache in case of fuse_instance_id change?
In any case, instead of a storm of revalidate messages after
server restart, do it lazily on demand.
Thanks,
Amir.
Powered by blists - more mailing lists