linux-ext4 - Re: [PATCH 1/5] fuse: flush pending fuse events before aborting the connection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251110185429.GZ196362@frogsfrogsfrogs>
Date: Mon, 10 Nov 2025 10:54:29 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Bernd Schubert <bernd@...ernd.com>
Cc: Joanne Koong <joannelkoong@...il.com>, miklos@...redi.hu,
	neal@...pa.dev, linux-ext4@...r.kernel.org,
	linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 1/5] fuse: flush pending fuse events before aborting the
 connection

On Mon, Nov 10, 2025 at 07:22:57PM +0100, Bernd Schubert wrote:
> 
> 
> On 11/10/25 18:56, Darrick J. Wong wrote:
> > On Fri, Nov 07, 2025 at 04:02:54PM -0800, Darrick J. Wong wrote:
> >> On Fri, Nov 07, 2025 at 11:03:24PM +0100, Bernd Schubert wrote:
> >>>
> >>>
> >>> On 11/7/25 05:26, Darrick J. Wong wrote:
> >>>> [I read this email backwards, like I do]
> >>>>
> >>>> On Thu, Nov 06, 2025 at 10:37:41AM -0800, Joanne Koong wrote:
> >>>>> On Wed, Nov 5, 2025 at 4:17 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >>>>>>
> >>>>>> On Tue, Nov 04, 2025 at 11:22:26AM -0800, Joanne Koong wrote:
> >>>>>>
> >>>>>> <snipping here because this thread has gotten very long>
> >>>>>>
> >>>>>>>>>> +       while (wait_event_timeout(fc->blocked_waitq,
> >>>>>>>>>> +                       !fc->connected || atomic_read(&fc->num_waiting) == 0,
> >>>>>>>>>> +                       HZ) == 0) {
> >>>>>>>>>> +               /* empty */
> >>>>>>>>>> +       }
> >>>>>>>>>
> >>>>>>>>> I'm wondering if it's necessary to wait here for all the pending
> >>>>>>>>> requests to complete or abort?
> >>>>>>>>
> >>>>>>>> I'm not 100% sure what the fuse client shutdown sequence is supposed to
> >>>>>>>> be.  If someone kills a program with a large number of open unlinked
> >>>>>>>> files and immediately calls umount(), then the fuse client could be in
> >>>>>>>> the process of sending FUSE_RELEASE requests to the server.
> >>>>>>>>
> >>>>>>>> [background info, feel free to speedread this paragraph]
> >>>>>>>> For a non-fuseblk server, unmount aborts all pending requests and
> >>>>>>>> disconnects the fuse device.  This means that the fuse server won't see
> >>>>>>>> all the FUSE_REQUESTs before libfuse calls ->destroy having observed the
> >>>>>>>> fusedev shutdown.  The end result is that (on fuse2fs anyway) you end up
> >>>>>>>> with a lot of .fuseXXXXX files that nobody cleans up.
> >>>>>>>>
> >>>>>>>> If you make ->destroy release all the remaining open files, now you run
> >>>>>>>> into a second problem, which is that if there are a lot of open unlinked
> >>>>>>>> files, freeing the inodes can collectively take enough time that the
> >>>>>>>> FUSE_DESTROY request times out.
> >>>>>>>>
> >>>>>>>> On a fuseblk server with libfuse running in multithreaded mode, there
> >>>>>>>> can be several threads reading fuse requests from the fusedev.  The
> >>>>>>>> kernel actually sends its own FUSE_DESTROY request, but there's no
> >>>>>>>> coordination between the fuse workers, which means that the fuse server
> >>>>>>>> can process FUSE_DESTROY at the same time it's processing FUSE_RELEASE.
> >>>>>>>> If ->destroy closes the filesystem before the FUSE_RELEASE requests are
> >>>>>>>> processed, you end up with the same .fuseXXXXX file cleanup problem.
> >>>>>>>
> >>>>>>> imo it is the responsibility of the server to coordinate this and make
> >>>>>>> sure it has handled all the requests it has received before it starts
> >>>>>>> executing the destruction logic.
> >>>>>>
> >>>>>> I think we're all saying that some sort of fuse request reordering
> >>>>>> barrier is needed here, but there's at least three opinions about where
> >>>>>> that barrier should be implemented.  Clearly I think the barrier should
> >>>>>> be in the kernel, but let me think more about where it could go if it
> >>>>>> were somewhere else.
> >>>>>>
> >>>>>> First, Joanne's suggestion for putting it in the fuse server itself:
> >>>>>>
> >>>>>> I don't see how it's generally possible for the fuse server to know that
> >>>>>> it's processed all the requests that the kernel might have sent it.
> >>>>>> AFAICT each libfuse thread does roughly this:
> >>>>>>
> >>>>>> 1. read() a request from the fusedev fd
> >>>>>> 2. decode the request data and maybe do some allocations or transform it
> >>>>>> 3. call fuse server with request
> >>>>>> 4. fuse server does ... something with the request
> >>>>>> 5. fuse server finishes, hops back to libfuse / calls fuse_reply_XXX
> >>>>>>
> >>>>>> Let's say thread 1 is at step 4 with a FUSE_DESTROY.  How does it find
> >>>>>> out if there are other fuse worker threads that are somewhere in steps
> >>>>>> 2 or 3?  AFAICT the library doesn't keep track of the number of threads
> >>>>>> that are waiting in fuse_session_receive_buf_internal, so fuse servers
> >>>>>> can't ask the library about that either.
> >>>>>>
> >>>>>> Taking a narrower view, it might be possible for the fuse server to
> >>>>>> figure this out by maintaining an open resource count.  It would
> >>>>>> increment this counter when a FUSE_{OPEN,CREATE} request succeeds and
> >>>>>> decrement it when FUSE_RELEASE comes in.  Assuming that FUSE_RELEASE is
> >>>>>> the only kind of request that can be pending when a FUSE_DESTROY comes
> >>>>>> in, then destroy just has to wait for the counter to hit zero.
> >>>>>
> >>>>> I was thinking this logic could be in libfuse's fuse_loop_mt.c. Where
> >>>>> if there are X worker threads that are all running fuse_do_work( )
> >>>>> then if you get a FUSE_DESTROY on one of those threads that thread can
> >>>>> set some se->destroyed field. At this point the other threads will
> >>>>> have already called fuse_session_receive_buf_internal() on all the
> >>>>> flushed background requests, so after they process it and return from
> >>>>> fuse_session_process_buf_internal(), then they check if se->destroyed
> >>>>> was set, and if it is they exit the thread, while in the thread that
> >>>>> got the FUSE_DESTROY it sleeps until all the threads have completed
> >>>>> and then it executes the destroy logic.That to me seems like the
> >>>>> cleanest approach.
> >>>>
> >>>> Hrm.  Well now (scrolling to the bottom and back) that I know that the
> >>>> FUSE_DESTROY won't get put on the queue ahead of the FUSE_RELEASEs, I
> >>>> think that /could/ work.
> >>>>
> >>>> One tricky thing with having worker threads check a flag and exit is
> >>>> that they can be sleeping in the kernel (from _fuse_session_receive_buf)
> >>>> when the "just go away" flag gets set.  If the thread never wakes up,
> >>>> then it'll never exit.  In theory you could have the FUSE_DESTROY thread
> >>>> call pthread_cancel on all the other worker threads to eliminate them
> >>>> once they emerge from PTHREAD_CANCEL_DISABLE state, but I still have
> >>>> nightmares from adventures in pthread_cancel at Sun in 2002. :P
> >>>>
> >>>> Maybe an easier approach would be to have fuse_do_work increment a
> >>>> counter when it receives a buffer and decrement it when it finishes with
> >>>> that buffer.  The FUSE_DESTROY thread merely has to wait for that
> >>>> counter to reach 1, at which point it's the only thread with a request
> >>>> to process, so it can call do_destroy.  That at least would avoid adding
> >>>> a new user of pthread_cancel() into the mt loop code.
> >>>
> >>> I will read through the rest (too tired right now) durig the weekend. 
> >>> I was also thinking about counter. And let's please also do this right
> >>> also handling io-uring. I.e. all CQEs needs to have been handled.
> >>> Without io-uring it would be probably a counter in decreased in 
> >>> fuse_free_req(), with io-uring it is a bit more complex.
> >>
> >> Oh right, the uring backend.
> >>
> >> Assuming that it's really true that the only requests pending during an
> >> unmount are going to be FUSE_RELEASE (nobody's actually said that's
> >> true) then it's *much* easier to count the number of open files in
> >> fuse_session and make _do_destroy in the lowlevel library wait until the
> >> open file count reaches zero.
> > 
> > FWIW I tried this out over the weekend with the patch below and the
> > wait_event() turned off in the kernel.  It seems to work (though I only
> > tried it cursorily with iouring) so if the kernel fuse developers would
> > rather not have a wait_event() in the unmount path then I suppose this
> > is a way to move ahead with this topic.
> > 
> > --D
> > 
> > From: Darrick J. Wong <djwong@...nel.org>
> > Subject: [PATCH] libfuse: wait in do_destroy until all open files are closed
> > 
> > Joanne suggests that libfuse should defer a FUSE_DESTROY request until
> > all FUSE_RELEASEs have completed.  Let's see if that works by tracking
> > the count of open files.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
> > ---
> >  lib/fuse_i.h        |    4 ++++
> >  lib/fuse_lowlevel.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 55 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/fuse_i.h b/lib/fuse_i.h
> > index 0ce2c0134ed879..dfe9d9f067498e 100644
> > --- a/lib/fuse_i.h
> > +++ b/lib/fuse_i.h
> > @@ -117,6 +117,10 @@ struct fuse_session {
> >  	 */
> >  	uint32_t conn_want;
> >  	uint64_t conn_want_ext;
> > +
> > +	/* destroy has to wait for all the open files to go away */
> > +	pthread_cond_t zero_open_files;
> > +	uint64_t open_files;
> >  };
> >  
> >  struct fuse_chan {
> > diff --git a/lib/fuse_lowlevel.c b/lib/fuse_lowlevel.c
> > index 12724ed66bdcc8..f12c6db0eb0e60 100644
> > --- a/lib/fuse_lowlevel.c
> > +++ b/lib/fuse_lowlevel.c
> > @@ -52,6 +52,30 @@
> >  #define PARAM(inarg) (((char *)(inarg)) + sizeof(*(inarg)))
> >  #define OFFSET_MAX 0x7fffffffffffffffLL
> >  
> > +static inline void inc_open_files(struct fuse_session *se)
> > +{
> > +	pthread_mutex_lock(&se->lock);
> > +	se->open_files++;
> > +	pthread_mutex_unlock(&se->lock);
> > +}
> > +
> > +static inline void dec_open_files(struct fuse_session *se)
> > +{
> > +	pthread_mutex_lock(&se->lock);
> > +	se->open_files--;
> > +	if (!se->open_files)
> > +		pthread_cond_broadcast(&se->zero_open_files);
> > +	pthread_mutex_unlock(&se->lock);
> > +}
> 
> I think open files only decreases when destroy is received? Doesn't that
> give us the chance to use an atomic (C11) and then to broadcast only
> when FUSE_DESTROY is received? I.e. I would use an atomic for
> 'open_files', set something like 'se->destroy_received' and then trigger
> the broadcast only when that is set.

I'm sorry, but I'm not familiar enough with C11 atomics to know if that
would work.  I suppose it /has/ been 3 years since the kernel went from
C89 to C11 but some of us are still old dinosaurs who cut their teeth on
C back when the paint was still fresh on C90. ;)

I think it would be ok to define open_files as _Atomic and then do
something like:

	if (atomic_fetch_sub(&se->open_files, -1) == 1)
		cnd_broadcast(&cse->zero_open_files);

without needing to take se->lock.  I'm not sure how you'd handle races
between a thread setting destroy_received and dec_open_files testing
the flag, though.  Maybe it'd be easier to bias open_files upward by one
in the init method and downward by one in op_destroy so we'd never send
the broadcast until unmount time?

--D

> (A later further optimization with io-uring will be to have that counter
> per queue.)
> 
> 
> Thanks,
> Bernd
>