linux-ext4 - Re: [PATCH 1/5] fuse: flush pending fuse events before aborting the connection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251108004303.GX196362@frogsfrogsfrogs>
Date: Fri, 7 Nov 2025 16:43:03 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/5] fuse: flush pending fuse events before aborting the
 connection

On Fri, Nov 07, 2025 at 11:18:33AM -0800, Joanne Koong wrote:
> On Thu, Nov 6, 2025 at 8:26 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > [I read this email backwards, like I do]
> >
> > On Thu, Nov 06, 2025 at 10:37:41AM -0800, Joanne Koong wrote:
> > > On Wed, Nov 5, 2025 at 4:17 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > On Tue, Nov 04, 2025 at 11:22:26AM -0800, Joanne Koong wrote:
> > > >
> > > > <snipping here because this thread has gotten very long>
> > > >
> > > > > > > > +       while (wait_event_timeout(fc->blocked_waitq,
> > > > > > > > +                       !fc->connected || atomic_read(&fc->num_waiting) == 0,
> > > > > > > > +                       HZ) == 0) {
> > > > > > > > +               /* empty */
> > > > > > > > +       }
> > > > > > >
> > > > > > > I'm wondering if it's necessary to wait here for all the pending
> > > > > > > requests to complete or abort?
> > > > > >
> > > > > > I'm not 100% sure what the fuse client shutdown sequence is supposed to
> > > > > > be.  If someone kills a program with a large number of open unlinked
> > > > > > files and immediately calls umount(), then the fuse client could be in
> > > > > > the process of sending FUSE_RELEASE requests to the server.
> > > > > >
> > > > > > [background info, feel free to speedread this paragraph]
> > > > > > For a non-fuseblk server, unmount aborts all pending requests and
> > > > > > disconnects the fuse device.  This means that the fuse server won't see
> > > > > > all the FUSE_REQUESTs before libfuse calls ->destroy having observed the
> > > > > > fusedev shutdown.  The end result is that (on fuse2fs anyway) you end up
> > > > > > with a lot of .fuseXXXXX files that nobody cleans up.
> > > > > >
> > > > > > If you make ->destroy release all the remaining open files, now you run
> > > > > > into a second problem, which is that if there are a lot of open unlinked
> > > > > > files, freeing the inodes can collectively take enough time that the
> > > > > > FUSE_DESTROY request times out.
> > > > > >
> > > > > > On a fuseblk server with libfuse running in multithreaded mode, there
> > > > > > can be several threads reading fuse requests from the fusedev.  The
> > > > > > kernel actually sends its own FUSE_DESTROY request, but there's no
> > > > > > coordination between the fuse workers, which means that the fuse server
> > > > > > can process FUSE_DESTROY at the same time it's processing FUSE_RELEASE.
> > > > > > If ->destroy closes the filesystem before the FUSE_RELEASE requests are
> > > > > > processed, you end up with the same .fuseXXXXX file cleanup problem.
> > > > >
> > > > > imo it is the responsibility of the server to coordinate this and make
> > > > > sure it has handled all the requests it has received before it starts
> > > > > executing the destruction logic.
> > > >
> > > > I think we're all saying that some sort of fuse request reordering
> > > > barrier is needed here, but there's at least three opinions about where
> > > > that barrier should be implemented.  Clearly I think the barrier should
> > > > be in the kernel, but let me think more about where it could go if it
> > > > were somewhere else.
> > > >
> > > > First, Joanne's suggestion for putting it in the fuse server itself:
> > > >
> > > > I don't see how it's generally possible for the fuse server to know that
> > > > it's processed all the requests that the kernel might have sent it.
> > > > AFAICT each libfuse thread does roughly this:
> > > >
> > > > 1. read() a request from the fusedev fd
> > > > 2. decode the request data and maybe do some allocations or transform it
> > > > 3. call fuse server with request
> > > > 4. fuse server does ... something with the request
> > > > 5. fuse server finishes, hops back to libfuse / calls fuse_reply_XXX
> > > >
> > > > Let's say thread 1 is at step 4 with a FUSE_DESTROY.  How does it find
> > > > out if there are other fuse worker threads that are somewhere in steps
> > > > 2 or 3?  AFAICT the library doesn't keep track of the number of threads
> > > > that are waiting in fuse_session_receive_buf_internal, so fuse servers
> > > > can't ask the library about that either.
> > > >
> > > > Taking a narrower view, it might be possible for the fuse server to
> > > > figure this out by maintaining an open resource count.  It would
> > > > increment this counter when a FUSE_{OPEN,CREATE} request succeeds and
> > > > decrement it when FUSE_RELEASE comes in.  Assuming that FUSE_RELEASE is
> > > > the only kind of request that can be pending when a FUSE_DESTROY comes
> > > > in, then destroy just has to wait for the counter to hit zero.
> > >
> > > I was thinking this logic could be in libfuse's fuse_loop_mt.c. Where
> > > if there are X worker threads that are all running fuse_do_work( )
> > > then if you get a FUSE_DESTROY on one of those threads that thread can
> > > set some se->destroyed field. At this point the other threads will
> > > have already called fuse_session_receive_buf_internal() on all the
> > > flushed background requests, so after they process it and return from
> > > fuse_session_process_buf_internal(), then they check if se->destroyed
> > > was set, and if it is they exit the thread, while in the thread that
> > > got the FUSE_DESTROY it sleeps until all the threads have completed
> > > and then it executes the destroy logic.That to me seems like the
> > > cleanest approach.
> >
> > Hrm.  Well now (scrolling to the bottom and back) that I know that the
> > FUSE_DESTROY won't get put on the queue ahead of the FUSE_RELEASEs, I
> > think that /could/ work.
> >
> > One tricky thing with having worker threads check a flag and exit is
> > that they can be sleeping in the kernel (from _fuse_session_receive_buf)
> > when the "just go away" flag gets set.  If the thread never wakes up,
> > then it'll never exit.  In theory you could have the FUSE_DESTROY thread
> > call pthread_cancel on all the other worker threads to eliminate them
> > once they emerge from PTHREAD_CANCEL_DISABLE state, but I still have
> > nightmares from adventures in pthread_cancel at Sun in 2002. :P
> >
> > Maybe an easier approach would be to have fuse_do_work increment a
> > counter when it receives a buffer and decrement it when it finishes with
> > that buffer.  The FUSE_DESTROY thread merely has to wait for that
> > counter to reach 1, at which point it's the only thread with a request
> > to process, so it can call do_destroy.  That at least would avoid adding
> > a new user of pthread_cancel() into the mt loop code.
> >
> > > >
> > > > Is the above assumption correct?
> > > >
> > > > I don't see any fuse servers that actually *do* this, though.  I
> > > > perceive that there are a lot of fuse servers out there that aren't
> > > > packaged in Debian, though, so is this actually a common thing for
> > > > proprietary fuse servers which I wouldn't know about?
> > > >
> > > > Downthread, Bernd suggested doing this in libfuse instead of making the
> > > > fuse servers do it.  He asks:
> > > >
> > > > "There is something I don't understand though, how can FUSE_DESTROY
> > > > happen before FUSE_RELEASE is completed?
> > > >
> > > > "->release / fuse_release
> > > >    fuse_release_common
> > > >       fuse_file_release
> > > >          fuse_file_put
> > > >             fuse_simple_background
> > > >             <userspace>
> > > >             <userspace-reply>
> > > >                fuse_release_end
> > > >                   iput()"
> > > >
> > > > The answer to this is: fuse_file_release is always asynchronous now, so
> > > > the FUSE_RELEASE is queued to the background and the kernel moves on
> > > > with its life.
> > > >
> > > > It's likely much more effective to put the reordering barrier in the
> > > > library (ignoring all the vendored libfuse out there) assuming that the
> > > > above assumption holds.  I think it wouldn't be hard to have _do_open
> > > > (fuse_lowlevel.c) increment a counter in fuse_session, decrement it in
> > > > _do_release, and then _do_destroy would wait for it to hit zero.
> > > >
> > > > For a single-threaded fuse server I think this might not even be an
> > > > issue because the events are (AFAICT) processed in order.  However,
> > > > you'd have to be careful about how you did that for a multithreaded fuse
> > > > server.  You wouldn't want to spin in _do_destroy because that takes out
> > > > a thread that could be doing work.  Is there a way to park a request?
> > >
> > > If the background requests are flushed before the destroy request,
> > > then this doesn't take out a thread because al the background requests
> > > will already have been or are being serviced.
> >
> > <nod>
> >
> > I'm still concerned about a few things with the libfuse approach though.
> > The kernel is the initiator, so it knows the data dependencies between
> > requests.  Consequently, it's in the best position to know that if
> > request R2 depends on R1, then it shouldn't issue R2 until it has
> > received an acknowledgement for R1.  The fuse server is the target, it
> > shouldn't be second-guessing what the initiator wants.
> >
> > The second concern is that if a request timeout is in effect, then all
> > the time that libfuse spends waiting for other request to drain is
> > charged to that request.  IOWs, if the timeout is 60s and libfuse holds
> > the FUSE_DESTROY for 40s, the fuse server only has 20s to reply to the
> > request whereas the sysadmin might have assumed that the server would
> > have a full 60s to flush the filesystem and exit.
> >
> > If you're worried about no-timeout fuse servers hanging the unmount
> > process, what about killing the fuse server?  Or telling the kernel to
> > abort the connection?  Either should suffice to kill the wait_event
> > loop.
> >
> > The third thing is that the iomap patchset will change the unmount
> > request sequence:
> >
> > 1. <some number of FUSE_RELEASEs>
> > 2. FUSE_SYNCFS to tell the fuse server to write all its dirty data
> > 3. FUSE_DESTROY to close the filesystem
> 
> imo it's the responsibility of FUSE_DESTROY to write out any lingering
> dirty data before tearing down state and there shouldn't need to be an
> extra FUSE_SYNCFS issued before the destroy.
> 
> >
> > If we put the ordering barrier in libfuse, then we'll have to modify
> > libfuse to flush all of (1) before processing (2), and then wait for (2)
> > to finish before processing (3).  But libfuse doesn't know that a
> > particular FUSE_SYNCFS will be succeeded by a FUSE_DESTROY.  I could
> > drop the SYNCFS and let DESTROY handle all the flushing.  But again, the
> > kernel already knows the ordering that it requires, so it should enforce
> > that ordering directly.
> >
> > (Sorry, I feel like I'm belaboring the point excessively, I'll stop)
> >
> > The libfuse approach /does/ have the small advantage that it can start
> > working on the FUSE_DESTROY as soon as the other workers quiesce because
> > it doesn't have to wait for the kernel to see the last FUSE_RELEASE
> > reply and generate the DESTROY request.
> 
> imo the main advantage is that handling this in userspace offers
> servers a lot more flexibility for controlling unmount/destroy
> behavior. If we have this logic in the kernel, we're guaranteeing that
> all previous requests must be completed (no matter how long they take)
> before the server sees FUSE_DESTROY. For example if some servers in
> the future want to prioritize fast unmounts over having lingering
> unlinked files, then this allows for that. Or if the server wants to
> write out data while RELEASE is happening, they can do that too
> instead of having to wait for the RELEASE to finish.
> 
> I'm not sure if we even really do need to wait for RELEASE to finish
> before replying back to the DESTROY request. For unlinking files and
> writing data out to storage for example, can't you do all of that
> still even after replying back to the DESTROY request? The connection
> will be aborted but that seems fine, that just means the client can't
> communicate with the server. This would let the unmount finish quickly
> while meanwhile the server logic can run still even after the unmount
> / connection abort, and the server could then do/finish all the
> unlinking logic / writing to storage. Then when all of that is
> finished, the server could officially clean everything up and exit.

No, because fuseblk and fuse-iomap servers MUST close the block devices
before the umount(2) call returns so that an immediate mount(8) doesn't
fail because the block device is still open O_EXCL.  This is a
fundamental behavior of all local filesystems because O_EXCL is the
*only* mechanism that prevents two separate mounts of the same device,
and mount(2) fails immediately if it can't open a block device O_EXCL.

fstests and many other scripts rely on the behavior that the block
device is always free when umount returns.  I know it's tempting to make
regular unmount return ASAP even if the filesystem isn't quite done, but
that's what umount -l is for.

--D

> > > > For a fuseblk filesystem this abort is very dangerous because the kernel
> > > > releases its O_EXCL hold on the block device in kill_block_super before
> > > > the fuse server has a chance to finish up and close the block device.
> > > > The fuseblk server itself could not have opened the block device O_EXCL
> > > > so that means there's a period where another process (or even another
> > > > fuseblk mount) could open the bdev O_EXCL and both try to write to the
> > > > block device.
> > > >
> > > > (I actually have been wondering who uses the fuse request timeouts?  In
> > > > my testing even 30min wasn't sufficient to avoid aborts for some of the
> > > > truncate/inactivation fstests.)
> > >
> > > Meta uses fuse request timeouts. We saw a few cases of deadlocks in
> > > some buggy fuse server implementations, so we now enforce default
> > > timeouts. The timeout is set to a pretty large number though. Our main
> > > use of it is to free/cleanup system resources if the server is
> > > deadlocked.
> >
> > If you can share, how long of a timeout?  I've noticed that some clouds
> > set their iscsi timeouts to 12h or more(!) and that's for a single SCSI
> > command.
> 
> Right now the default is 30 minutes.
> 
> Thanks,
> Joanne
> 
> >
> > > If it takes 30 minutes to do all the cleanup, then I think it's worse
> > > to have unmounting take that long, than to just do a quicker unmount
> >
> > If you don't handle unlinked lists in a O(n) (or O(1) way) then
> > unprivileged userspace programs can manipulate the filesystem so that it
> > actually /can/ take hours to unmount.  XFS, ext4, and now fuse4fs have
> > learned that the hard way. ;)
> >
> > > and have lingering unlinked files on the server. As a user, if I were
> > > unmounting something and it took that long, I would probably just kill
> > > the whole thing anyways.
> >
> > That very much depends on what you're going to do with that filesystem.
> > If you're disposing of a container then, meh, fire away.  Some people
> > "use" FS_IOC_SHUTDOWN to "terminate" containers quickly.
> >
> > > >
> > > > Aside: The reason why I abandoned making fuse2fs a fuseblk server is
> > > > because I realized this exact trap -- the fuse server MUST have
> > > > exclusive write access to the device at all times, or else it can race
> > > > with other programs (e.g. tune2fs) and corrupt the filesystem.  In
> > > > fuseblk mode the kernel owns the exclusive access and but doesn't
> > > > install that file in the server's fd table.  At best the fuse server can
> > > > pretend that it has exclusive write access, but the kernel can make that
> > > > go away without telling the fuse server, which opens a world of hurt.
> > > >
> > > > > imo the only responsibility of the
> > > > > kernel is to actually send the background requests before it sends the
> > > > > FUSE_DESTROY. I think non-fuseblk servers should also receive the
> > > > > FUSE_DESTROY request.
> > > >
> > > > They do receive it because fuse_session_destroy calls ->destroy if no
> > > > event has been received from the kernel after the fusedev shuts down.
> > > >
> > > > > >
> > > > > > Here, if you make a fuseblk server's ->destroy release all the remaining
> > > > > > open files, you have an even worse problem, because that could race with
> > > > > > an existing libfuse worker that's processing a FUSE_RELEASE for the same
> > > > > > open file.
> > > > > >
> > > > > > In short, the client has a FUSE_RELEASE request that pairs with the
> > > > > > FUSE_OPEN request.  During regular operations, an OPEN always ends with
> > > > > > a RELEASE.  I don't understand why unmount is special in that it aborts
> > > > > > release requests without even sending them to the server; that sounds
> > > > > > like a bug to me.  Worse yet, I looked on Debian codesearch, and nearly
> > > > > > all of the fuse servers I found do not appear to handle this correctly.
> > > > > > My guess is that it's uncommon to close 100,000 unlinked open files on a
> > > > > > fuse filesystem and immediately unmount it.  Network filesystems can get
> > > > > > away with not caring.
> > > > > >
> > > > > > For fuse+iomap, I want unmount to send FUSE_SYNCFS after all open files
> > > > > > have been RELEASEd so that client can know that (a) the filesystem (at
> > > > > > least as far as the kernel cares) is quiesced, and (b) the server
> > > > > > persisted all dirty metadata to disk.  Only then would I send the
> > > > > > FUSE_DESTROY.
> > > > >
> > > > > Hmm, is FUSE_FLUSH not enough? As I recently learned (from Amir),
> > > > > every close() triggers a FUSE_FLUSH. For dirty metadata related to
> > > > > writeback, every release triggers a synchronous write_inode_now().
> > > >
> > > > It's not sufficient, because there might be other cached dirty metadata
> > > > that needs to be flushed out to disk.  A fuse server could respond to a
> > > > FUSE_FLUSH by pushing out that inode's dirty metadata to disk but go no
> > > > farther.  Plumbing in FUSE_SYNCFS for iomap helps a lot in that regard
> > > > because that's a signal that we need to push dirty ext4 bitmaps and
> > > > group descriptors and whatnot out to storage; without it we end up doing
> > > > all that at destroy time.
> > > >
> > > > > > > We are already guaranteeing that the
> > > > > > > background requests get sent before we issue the FUSE_DESTROY, so it
> > > > > > > seems to me like this is already enough and we could skip the wait
> > > > > > > because the server should make sure it completes the prior requests
> > > > > > > it's received before it executes the destruction logic.
> > > > > >
> > > > > > That's just the thing -- fuse_conn_destroy calls fuse_abort_conn which
> > > > > > aborts all the pending background requests so the server never sees
> > > > > > them.
> > > > >
> > > > > The FUSE_DESTROY request gets sent before fuse_abort_conn() is called,
> > > > > so to me, it seems like if we flush all the background requests and
> > > > > then send the FUSE_DESTROY, that suffices.
> > > >
> > > > I think it's worse than that -- fuse_send_destroy sets fuse_args::force
> > > > and sends the request synchronously, which (afaict) means it jumps ahead
> > > > of the backgrounded requests.
> > >
> > > Hmm, where are you seeing that? afaict, args->force forces the request
> > > to be sent to userspace even if interrupted and it skips the
> > > fuse_block_alloc() check.
> >
> > Oh!  You're right, the FUSE_DESTROY request is list_add_tail'd to
> > fiq->pending, just like every other req, because they all go through
> > fuse_dev_queue_req.  Sorry about misreading that, but thank /you/ for
> > pointing it out! :)
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > >
>