lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31f7822e84583235d84b8c7be24360c46c7450f7.camel@hammerspace.com>
Date:   Mon, 21 Mar 2022 22:50:11 +0000
From:   Trond Myklebust <trondmy@...merspace.com>
To:     "jack@...e.cz" <jack@...e.cz>,
        "amir73il@...il.com" <amir73il@...il.com>
CC:     "bfields@...ldses.org" <bfields@...ldses.org>,
        "khazhy@...gle.com" <khazhy@...gle.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "jlayton@...nel.org" <jlayton@...nel.org>,
        "chuck.lever@...cle.com" <chuck.lever@...cle.com>
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify

On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote:
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust
> > <trondmy@...merspace.com> wrote:
> > > 
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below
> > > > stack.
> > > > 
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > > 
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@...gle.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > Marking this RFC since I haven't actually had a chance to test
> > > > this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > > 
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > > 
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group-
> > > > >mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > > 
> > > > +               /* fsnotify allocates, avoid recursion back
> > > > into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark,
> > > > inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > > 
> > > >                 /*
> > > >                  * If the add was successful, then return the
> > > > object.
> > > 
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > > 
> > > Does memalloc_nofs_save()/restore() really fix this problem? It
> > > seems
> > > to me that it cannot, particularly since knfsd is not a
> > > filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > > 
> > 
> > Maybe NOFS throttles direct reclaims to the point that the problem
> > is
> > harder to hit?
> > 
> > This report came in at good timing for me.
> > 
> > It demonstrates an issue I did not predict for "volatile"' fanotify
> > marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify
> > backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks
> > would also
> > be evictable in that way, so they would expose fanotify to this
> > deadlock.
> > 
> > For the short term, maybe nfsd filecache can avoid the problem by
> > checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort
> > the
> > shrinker. I wonder if there is a place for a helper
> > mutex_is_locked_by_me()?
> > 
> > Jan,
> > 
> > A relatively simple fix would be to allocate
> > fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for
> > the object.
> > I don't think there is a good reason to optimize away this
> > allocation
> > for the case of a non-first group to set a mark on an object?
> 
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set
> of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about
> connector
> allocations but also mark allocations for fanotify. Moving
> allocations from
> under mark_mutex is also possible solution but passing preallocated
> memory
> around is kind of ugly as well. So the cleanest solution I currently
> see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also
> do
> memalloc_nofs_save / restore magic. 

As has already been reported, the problem was fixed in Linux 5.5 by the
garbage collector rewrite, and so this is no longer an issue.

In addition, please note that memalloc_nofs_save/restore and the use of
GFP_NOFS was never a solution, because it does not prevent the kind of
direct reclaim that was happening here. You'd have to enforce
GFP_NOWAIT allocations, afaics.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@...merspace.com


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ