linux-kernel - Re: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <170233229855.12910.12943965536699322442@noble.neil.brown.name>
Date:   Tue, 12 Dec 2023 09:04:58 +1100
From:   "NeilBrown" <neilb@...e.de>
To:     "Chuck Lever" <chuck.lever@...cle.com>
Cc:     "Al Viro" <viro@...iv.linux.org.uk>,
        "Christian Brauner" <brauner@...nel.org>,
        "Jens Axboe" <axboe@...nel.dk>, "Oleg Nesterov" <oleg@...hat.com>,
        "Jeff Layton" <jlayton@...nel.org>, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org
Subject: Re: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

On Tue, 12 Dec 2023, Chuck Lever wrote:
> On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> > On Sat, 09 Dec 2023, Chuck Lever wrote:
> > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > > Calling fput() directly or though filp_close() from a kernel thread like
> > > > nfsd causes the final __fput() (if necessary) to be called from a
> > > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > > reason, this can result in nfsd closing files more quickly than the
> > > > workqueue can complete the close and the queue of pending closes can
> > > > grow without bounces (30 million has been seen at one customer site,
> > > > though this was in part due to a slowness in xfs which has since been
> > > > fixed).
> > > > 
> > > > nfsd does not need this.
> > > 
> > > That is technically true, but IIUC, there is only one case where a
> > > synchronous close matters for the backlog problem, and that's when
> > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > > call sites (except rename) are error paths, so there aren't negative
> > > consequences for the lack of synchronous wait there...
> > 
> > What you say is technically true but it isn't the way I see it.
> > 
> > Firstly I should clarify that __fput_sync() is *not* a flushing close as
> > you describe it below.
> > All it does, apart for some trivial book-keeping, is to call ->release
> > and possibly ->destroy_inode immediately rather than shunting them off
> > to another thread.
> > Apparently ->release sometimes does something that can deadlock with
> > some kernel threads or if some awkward locks are held, so the whole
> > final __fput is delay by default.  But this does not apply to nfsd.
> > Standard fput() is really the wrong interface for nfsd to use.  
> > It should use __fput_sync() (which shouldn't have such a scary name).
> > 
> > The comment above flush_delayed_fput() seems to suggest that unmounting
> > is a core issue.  Maybe the fact that __fput() can call
> > dissolve_on_fput() is a reason why it is sometimes safer to leave the
> > work to later.  But I don't see that applying to nfsd.
> > 
> > Of course a ->release function *could* do synchronous writes just like
> > the XFS ->destroy_inode function used to do synchronous reads.
> 
> I had assumed ->release for NFS re-export would flush due to close-
> to-open semantics. There seem to be numerous corner cases that
> might result in pile-ups which would change the situation in your
> problem statement but might not result in an overall improvement.

That's the ->flush call in filp_close().

> 
> 
> > I don't think we should ever try to hide that by putting it in
> > a workqueue.  It's probably a bug and it is best if bugs are visible.
> 
> 
> I'm not objecting, per se, to this change. I would simply like to
> see a little more due diligence before moving forward until it is
> clear how frequently ->release or ->destroy_inode will do I/O (or
> "is slow for any reason" as you say above).
> 
> 
> > Note that the XFS ->release function does call filemap_flush() in some
> > cases, but that is an async flush, so __fput_sync doesn't wait for the
> > flush to complete.
> 
> When Jeff was working on the file cache a year ago, I did some
> performance analysis that shows even an async flush is costly when
> there is a lot of dirty data in the file being closed. The VFS walks
> through the whole file and starts I/O on every dirty page. This is
> quite CPU intensive, and can take on the order of a millisecond
> before the async flush request returns to its caller.
> 
> IME async flushes are not free.

True, they aren't free.  But some thread has to pay that price.
I think nfsd should.

You might argue that nfsd should wait to pay the price until after it
has sent a reply to the client.  My patches already effectively do that
for garbage-collected files.  Doing it for all files would probably be
easy. But is it really worth the (small) complexity?  I don't know.

> 
> 
> > The way I see this patch is that fput() is the wrong interface for nfsd
> > to use, __fput_sync is the right interface.  So we should change.  1
> > patch.
> 
> The practical matter is I see this as a change with a greater than
> zero risk, and we need to mitigate that risk. Or rather, as a
> maintainer of NFSD, /I/ need to see that the risk is as minimal as
> is practical.
> 
> 
> > The details about exhausting memory explain a particular symptom that
> > motivated the examination which revealed that nfsd was using the wrong
> > interface.
> > 
> > If we have nfsd sometimes using fput() and sometimes __fput_sync, then
> > we need to have clear rules for when to use which.  It is much easier to
> > have a simple rule: always use __fput_sync().
> 
> I don't agree that we should just flop all these over and hope for
> the best. In particular:
> 
>  - the changes in fs/nfsd/filecache.c appear to revert a bug
>    fix, so I need to see data that shows that change doesn't
>    cause a re-regression

The bug fix you refer to is
  "nfsd: don't fsync nfsd_files on last close"
The patch doesn't change when fsync (or ->flush) is called, so
it doesn't revert this bugfix.

> 
>  - the changes in fs/lockd/ can result in long waits while a
>    global mutex is held (global as in all namespaces and all
>    locked files on the server), so I need to see data that
>    demonstrates there won't be a regression

It's probably impossible to provide any such data.
The patch certainly moves work inside that mutex and so would increase
the hold time, if only slightly.  Is that lock hot enough to notice?
Conventional wisdom is that locking is only a tiny fraction of NFS
traffic.  It might be possible to construct a workload that saturates
lockd, but I doubt it would be relevant to the real world.

Maybe we should just break up that lock so that the problem becomes moot.

> 
>  - the other changes don't appear to have motivation in terms
>    of performance or behavior, and carry similar (if lesser)
>    risks as the other two changes. My preferred solution to
>    potential auditor confusion about the use of __fput_sync()
>    in some places and fput() in others is to document, and
>    leave call sites alone if there's no technical reason to
>    change them at this time.

Sounds to me like a good way to grow technical debt, but I'll do it like
that if you prefer.

> 
> There is enough of a risk of regression that I need to see a clear
> rationale for each hunk /and/ I need to see data that there is
> no regression. I know that won't be perfect coverage, but it's
> better than not having any data at all.
> 
> 
> > I'm certainly happy to revise function documentation and provide
> > wrapper functions if needed.
> > 
> > It might be good to have
> > 
> >   void filp_close_sync(struct file *f)
> >   {
> >        get_file(f);
> >        filp_close(f);
> >        __fput_sync(f);
> >   }
> > 
> > but as that would only be called once, it was hard to motivate.
> > Having it in linux/fs.h would be nice.
> > 
> > Similarly we could wrap __fput_sync() in a more friendly name, but
> > that would be better if we actually renamed the function.
> > 
> >   void fput_now(struct file *f)
> >   {
> >       __fput_sync(f);
> >   }
> > 
> > ??
> 
> Since this is an issue strictly for nfsd, the place for this
> utility function is in fs/nfsd/vfs.c, IMO, along with a documenting
> comment that provides a rationale for why nfsd does not want plain
> fput() in specific cases.
> 
> When other subsystems need a similar capability, then let's
> consider a common helper.

fs/smb/server/ probably would benefit from the same helper today.

Thanks,
NeilBrown


> 
> 
> -- 
> Chuck Lever
>