[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1a4d__8RHu0EGN2Yfk3oOhqZLJ7fBCNQYdHoThPrvnOaQ@mail.gmail.com>
Date: Tue, 4 Nov 2025 11:22:26 -0800
From: Joanne Koong <joannelkoong@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/5] fuse: flush pending fuse events before aborting the connection
On Mon, Nov 3, 2025 at 2:13 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Mon, Nov 03, 2025 at 09:20:26AM -0800, Joanne Koong wrote:
> > On Tue, Oct 28, 2025 at 5:43 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@...nel.org>
> > >
> > > generic/488 fails with fuse2fs in the following fashion:
> > >
> > > generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
> > > (see /var/tmp/fstests/generic/488.full for details)
> > >
> > > This test opens a large number of files, unlinks them (which really just
> > > renames them to fuse hidden files), closes the program, unmounts the
> > > filesystem, and runs fsck to check that there aren't any inconsistencies
> > > in the filesystem.
> > >
> > > Unfortunately, the 488.full file shows that there are a lot of hidden
> > > files left over in the filesystem, with incorrect link counts. Tracing
> > > fuse_request_* shows that there are a large number of FUSE_RELEASE
> > > commands that are queued up on behalf of the unlinked files at the time
> > > that fuse_conn_destroy calls fuse_abort_conn. Had the connection not
> > > aborted, the fuse server would have responded to the RELEASE commands by
> > > removing the hidden files; instead they stick around.
> > >
> > > For upper-level fuse servers that don't use fuseblk mode this isn't a
> > > problem because libfuse responds to the connection going down by pruning
> > > its inode cache and calling the fuse server's ->release for any open
> > > files before calling the server's ->destroy function.
> > >
> > > For fuseblk servers this is a problem, however, because the kernel sends
> > > FUSE_DESTROY to the fuse server, and the fuse server has to close the
> > > block device before returning. This means that the kernel must flush
> > > all pending FUSE_RELEASE requests before issuing FUSE_DESTROY.
> > >
> > > Create a function to push all the background requests to the queue and
> > > then wait for the number of pending events to hit zero, and call this
> > > before sending FUSE_DESTROY. That way, all the pending events are
> > > processed by the fuse server and we don't end up with a corrupt
> > > filesystem.
> > >
> > > Note that we use a wait_event_timeout() loop to cause the process to
> > > schedule at least once per second to avoid a "task blocked" warning:
> > >
> > > INFO: task umount:1279 blocked for more than 20 seconds.
> > > Not tainted 6.17.0-rc7-xfsx #rc7
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag.
> > > task:umount state:D stack:11984 pid:1279 tgid:1279 ppid:10690
> > >
> > > Earlier in the threads about this patch there was a (self-inflicted)
> > > dispute as to whether it was necessary to call touch_softlockup_watchdog
> > > in the loop body. Because the process goes to sleep, it's not necessary
> > > to touch the softlockup watchdog because we're not preventing another
> > > process from being scheduled on a CPU.
> > >
> > > Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
> > > ---
> > > fs/fuse/fuse_i.h | 5 +++++
> > > fs/fuse/dev.c | 35 +++++++++++++++++++++++++++++++++++
> > > fs/fuse/inode.c | 11 ++++++++++-
> > > 3 files changed, 50 insertions(+), 1 deletion(-)
> > >
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index c2f2a48156d6c5..aaa8574fd72775 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -1274,6 +1274,11 @@ void fuse_request_end(struct fuse_req *req);
> > > void fuse_abort_conn(struct fuse_conn *fc);
> > > void fuse_wait_aborted(struct fuse_conn *fc);
> > >
> > > +/**
> > > + * Flush all pending requests and wait for them.
> > > + */
> > > +void fuse_flush_requests_and_wait(struct fuse_conn *fc);
> > > +
> > > /* Check if any requests timed out */
> > > void fuse_check_timeout(struct work_struct *work);
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 132f38619d7072..ecc0a5304c59d1 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -24,6 +24,7 @@
> > > #include <linux/splice.h>
> > > #include <linux/sched.h>
> > > #include <linux/seq_file.h>
> > > +#include <linux/nmi.h>
> > >
> > > #include "fuse_trace.h"
> > >
> > > @@ -2430,6 +2431,40 @@ static void end_polls(struct fuse_conn *fc)
> > > }
> > > }
> > >
> > > +/*
> > > + * Flush all pending requests and wait for them. Only call this function when
> > > + * it is no longer possible for other threads to add requests.
> > > + */
> > > +void fuse_flush_requests_and_wait(struct fuse_conn *fc)
> > > +{
> > > + spin_lock(&fc->lock);
> >
> > Do we need to grab the fc lock? fc->connected is protected under the
> > bg_lock, afaict from fuse_abort_conn().
>
> Oh, heh. Yeah, it does indeed take both fc->lock and fc->bg_lock.
> Will fix that, thanks. :)
>
> FWIW I don't think it's a big deal if we see a stale connected==1 value
> because the events will all get cancelled and the wait loop won't run
> anyway, but I agree with being consistent about lock ordering. :)
>
> > > + if (!fc->connected) {
> > > + spin_unlock(&fc->lock);
> > > + return;
> > > + }
> > > +
> > > + /* Push all the background requests to the queue. */
> > > + spin_lock(&fc->bg_lock);
> > > + fc->blocked = 0;
> > > + fc->max_background = UINT_MAX;
> > > + flush_bg_queue(fc);
> > > + spin_unlock(&fc->bg_lock);
> > > + spin_unlock(&fc->lock);
> > > +
> > > + /*
> > > + * Wait for all pending fuse requests to complete or abort. The fuse
> > > + * server could take a significant amount of time to complete a
> > > + * request, so run this in a loop with a short timeout so that we don't
> > > + * trip the soft lockup detector.
> > > + */
> > > + smp_mb();
> > > + while (wait_event_timeout(fc->blocked_waitq,
> > > + !fc->connected || atomic_read(&fc->num_waiting) == 0,
> > > + HZ) == 0) {
> > > + /* empty */
> > > + }
> >
> > I'm wondering if it's necessary to wait here for all the pending
> > requests to complete or abort?
>
> I'm not 100% sure what the fuse client shutdown sequence is supposed to
> be. If someone kills a program with a large number of open unlinked
> files and immediately calls umount(), then the fuse client could be in
> the process of sending FUSE_RELEASE requests to the server.
>
> [background info, feel free to speedread this paragraph]
> For a non-fuseblk server, unmount aborts all pending requests and
> disconnects the fuse device. This means that the fuse server won't see
> all the FUSE_REQUESTs before libfuse calls ->destroy having observed the
> fusedev shutdown. The end result is that (on fuse2fs anyway) you end up
> with a lot of .fuseXXXXX files that nobody cleans up.
>
> If you make ->destroy release all the remaining open files, now you run
> into a second problem, which is that if there are a lot of open unlinked
> files, freeing the inodes can collectively take enough time that the
> FUSE_DESTROY request times out.
>
> On a fuseblk server with libfuse running in multithreaded mode, there
> can be several threads reading fuse requests from the fusedev. The
> kernel actually sends its own FUSE_DESTROY request, but there's no
> coordination between the fuse workers, which means that the fuse server
> can process FUSE_DESTROY at the same time it's processing FUSE_RELEASE.
> If ->destroy closes the filesystem before the FUSE_RELEASE requests are
> processed, you end up with the same .fuseXXXXX file cleanup problem.
imo it is the responsibility of the server to coordinate this and make
sure it has handled all the requests it has received before it starts
executing the destruction logic. imo the only responsibility of the
kernel is to actually send the background requests before it sends the
FUSE_DESTROY. I think non-fuseblk servers should also receive the
FUSE_DESTROY request.
>
> Here, if you make a fuseblk server's ->destroy release all the remaining
> open files, you have an even worse problem, because that could race with
> an existing libfuse worker that's processing a FUSE_RELEASE for the same
> open file.
>
> In short, the client has a FUSE_RELEASE request that pairs with the
> FUSE_OPEN request. During regular operations, an OPEN always ends with
> a RELEASE. I don't understand why unmount is special in that it aborts
> release requests without even sending them to the server; that sounds
> like a bug to me. Worse yet, I looked on Debian codesearch, and nearly
> all of the fuse servers I found do not appear to handle this correctly.
> My guess is that it's uncommon to close 100,000 unlinked open files on a
> fuse filesystem and immediately unmount it. Network filesystems can get
> away with not caring.
>
> For fuse+iomap, I want unmount to send FUSE_SYNCFS after all open files
> have been RELEASEd so that client can know that (a) the filesystem (at
> least as far as the kernel cares) is quiesced, and (b) the server
> persisted all dirty metadata to disk. Only then would I send the
> FUSE_DESTROY.
Hmm, is FUSE_FLUSH not enough? As I recently learned (from Amir),
every close() triggers a FUSE_FLUSH. For dirty metadata related to
writeback, every release triggers a synchronous write_inode_now().
>
> > We are already guaranteeing that the
> > background requests get sent before we issue the FUSE_DESTROY, so it
> > seems to me like this is already enough and we could skip the wait
> > because the server should make sure it completes the prior requests
> > it's received before it executes the destruction logic.
>
> That's just the thing -- fuse_conn_destroy calls fuse_abort_conn which
> aborts all the pending background requests so the server never sees
> them.
The FUSE_DESTROY request gets sent before fuse_abort_conn() is called,
so to me, it seems like if we flush all the background requests and
then send the FUSE_DESTROY, that suffices.
With the "while (wait_event_timeout(fc->blocked_waitq, !fc->connected
|| atomic_read(&fc->num_waiting) == 0...)" logic, I think this also
now means if a server is tripped up somewhere (eg if a remote network
connection is lost or it runs into a deadlock when servicing a
request) where it's unable to fulfill any one of its previous
requests, unmounting would hang.
Thanks,
Joanne
>
> --D
>
> > Thanks,
> > Joanne
> >
> > > +}
> > > +
> > > /*
> > > * Abort all requests.
> > > *
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index d1babf56f25470..d048d634ef46f5 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -2094,8 +2094,17 @@ void fuse_conn_destroy(struct fuse_mount *fm)
> > > {
> > > struct fuse_conn *fc = fm->fc;
> > >
> > > - if (fc->destroy)
> > > + if (fc->destroy) {
> > > + /*
> > > + * Flush all pending requests (most of which will be
> > > + * FUSE_RELEASE) before sending FUSE_DESTROY, because the fuse
> > > + * server must close the filesystem before replying to the
> > > + * destroy message, because unmount is about to release its
> > > + * O_EXCL hold on the block device.
> > > + */
> > > + fuse_flush_requests_and_wait(fc);
> > > fuse_send_destroy(fm);
> > > + }
> > >
> > > fuse_abort_conn(fc);
> > > fuse_wait_aborted(fc);
> > >
Powered by blists - more mailing lists