[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3gpfceywinbzsbgslwsywzv4qqubab6gcftlzag6drhl5vhmb6@iupru3v7wsey>
Date: Wed, 17 Sep 2025 16:42:08 +0200
From: Jan Kara <jack@...e.cz>
To: Amir Goldstein <amir73il@...il.com>
Cc: Jan Kara <jack@...e.cz>, Jakub Acs <acsjakub@...zon.de>,
linux-unionfs@...r.kernel.org, Miklos Szeredi <miklos@...redi.hu>, linux-kernel@...r.kernel.org,
stable@...r.kernel.org, Christian Brauner <brauner@...nel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] ovl: check before dereferencing s_root field
On Wed 17-09-25 13:07:45, Amir Goldstein wrote:
> On Wed, Sep 17, 2025 at 11:25 AM Jan Kara <jack@...e.cz> wrote:
> > On Tue 16-09-25 15:29:35, Amir Goldstein wrote:
> > > On Tue, Sep 16, 2025 at 1:30 PM Jan Kara <jack@...e.cz> wrote:
> > > >
> > > > On Mon 15-09-25 17:29:40, Amir Goldstein wrote:
> > > > > On Mon, Sep 15, 2025 at 4:07 PM Jan Kara <jack@...e.cz> wrote:
> > > > > > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> > > > > > > > index 83f80fdb1567..424c73188e06 100644
> > > > > > > > --- a/fs/overlayfs/export.c
> > > > > > > > +++ b/fs/overlayfs/export.c
> > > > > > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode)
> > > > > > > > if (!ovl_inode_lower(inode))
> > > > > > > > return 0;
> > > > > > > >
> > > > > > > > + if (!inode->i_sb->s_root)
> > > > > > > > + return -ENOENT;
> > > > > > >
> > > > > > > For a filesystem method to have to check that its own root is still alive sounds
> > > > > > > like the wrong way to me.
> > > > > > > That's one of the things that should be taken for granted by fs code.
> > > > > > >
> > > > > > > I don't think this is an overlayfs specific issue, because other fs would be
> > > > > > > happy if encode_fh() would be called with NULL sb->s_root.
> > > > > >
> > > > > > Actually, I don't see where that would blow up? Generally references to
> > > > > > sb->s_root in filesystems outside of mount / remount code are pretty rare.
> > > > > > Also most of the code should be unreachable by the time we set sb->s_root
> > > > > > to NULL because there are no open files at that moment, no exports etc. But
> > > > > > as this report shows, there are occasional surprises (I remember similar
> > > > > > issue with ext4 sysfs files handlers using s_root without checking couple
> > > > > > years back).
> > > > > >
> > > > >
> > > > > I am not sure that I understand what you are arguing for.
> > > > > I did a very naive grep s_root fs/*/export.c and quickly found:
> > > >
> > > > You're better with grep than me ;). I was grepping for '->s_root' as well
> > > > but all the hits I had looked into were related to mounting and similar and
> > > > eventually I got bored. Restricting the grep to export ops indeed shows
> > > > ceph, gfs2 and overlayfs are vulnerable to this kind of problem.
>
> As far as I can tell, ceph uses s_root only in decode_fh methods.
True. But ceph also uses d_find_alias() in ceph_encode_snapfh() which could
race with shrink_dcache_for_umount()->do_one_tree() and trigger:
WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} "
" still in use (%d) [unmount of %s %s]\n",
> ovl and gfs2 only want to know for an inode if it is the root inode,
> they do not strictly need to dereference s_root for that purpose.
> (see patch below)
>
> > So there are not many cases where this can happen but enough that I'd say
> > that handling some events specially to avoid encoding fh on fs while it is
> > unmounted is fragile and prone to breaking again sooner or later.
> >
> > > How about skipping fsnotify_inoderemove() in case sb is in shutdown?
> >
> > Also how would you like to handle that in a race-free manner? We'd need to
> > hold s_umount for that which we cannot really afford in that context. But
> > maybe you have some better idea...
> >
>
> I was only thinking about this code path:
>
> generic_shutdown_super()
> shrink_dcache_for_umount()
> ...
> __dentry_kill()
> dentry_unlink_inode()
>
> This is supposed to be the last dput of all remaining dentries
> and I don't think a deferred unlink should be expected in that case.
I see.
> But I realize now that you mean delayed unlink from another context
> which races with shutdown.
Yes, I've meant that.
> > > > > > > Can we change the order of generic_shutdown_super() so that
> > > > > > > fsnotify_sb_delete(sb) is called before setting s_root to NULL?
> > > > > > >
> > > > > > > Or is there a better solution for this race?
> > > > > >
> > > > > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL:
> > > > > > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after
> > > > > > evict_inodes")) we've moved the call after evict_inodes() because otherwise
> > > > > > we were just wasting cycles scanning many inodes without watches. So moving
> > > > > > it earlier wouldn't be great...
> > > > >
> > > > > Yes, I noticed that and I figured there were subtleties.
> > > >
> > > > Right. After thinking more about it I think calling fsnotify_sb_delete()
> > > > earlier is the only practical choice we have (not clearing sb->s_root isn't
> > > > much of an option - we need to prune all dentries to quiesce the filesystem
> > > > and leaving s_root alive would create odd corner cases). But you don't want
> > > > to be iterating millions of inodes just to clear couple of marks so we'll
> > > > have to figure out something more clever there.
> > >
> > > I think we only need to suppress the fsnotify_inoderemove() call.
> > > It sounds doable and very local to fs/super.c.
> > >
> > > Regarding show_mark_fhandle() WDYT about my suggestion to
> > > guard it with super_trylock_shared()?
> >
> > Yes, super_trylock_shared() for that callsite looks like a fine solution
> > for that call site. Occasional random failures in encoding fh because the
> > trylock fails are unlikely to have any bad consequences there. But I think
> > we need to figure out other possibly racing call-sites as well first.
> >
>
> Might something naive as this be enough?
It looks like it should be good for the problems with gfs2 & overlayfs but
it doesn't solve the problem with ceph and as Jakub writes there's a question
whether we won't hit more problems later.
I'm sorry for poking holes into your solutions. The more I look into this
the more problems I find :-|
As I'm thinking about it I'm slowly leaning towards implementing a list of
connectors per sb (so that we can quickly reclaim on umount). It seems
stupid just for these corner cases but longer term we can also implement
what Dave once suggested [1] so that fsnotify doesn't need to pin inodes in
memory at all which should more that make up for the additional memory for
inode connector members.
Honza
[1] https://lore.kernel.org/linux-fsdevel/ZwXDzKGj6Bp28kYe@dread.disaster.area/
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 60046ae23d514..8c9d0d6bb0045 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
>
> if (root_inode) {
> res = d_alloc_anon(root_inode->i_sb);
> - if (res)
> + if (res) {
> + root_inode->i_opflags |= IOP_ROOT;
> d_instantiate(res, root_inode);
> - else
> + } else {
> iput(root_inode);
> + }
> }
> return res;
> }
> diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
> index 3334c394ce9cb..809a09c6a89e0 100644
> --- a/fs/gfs2/export.c
> +++ b/fs/gfs2/export.c
> @@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32
> *p, int *len,
> fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF);
> *len = GFS2_SMALL_FH_SIZE;
>
> - if (!parent || inode == d_inode(sb->s_root))
> + if (!parent || is_root_inode(inode))
> return *len;
>
> ip = GFS2_I(parent);
> diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> index 83f80fdb15674..7827c63354ad5 100644
> --- a/fs/overlayfs/export.c
> +++ b/fs/overlayfs/export.c
> @@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode)
> * Root is never indexed, so if there's an upper layer, encode upper for
> * root.
> */
> - if (inode == d_inode(inode->i_sb->s_root))
> + if (is_root_inode(inode))
> return 0;
>
> /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ec867f112fd5f..ed84379aa06ca 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl)
> #define IOP_DEFAULT_READLINK 0x0010
> #define IOP_MGTIME 0x0020
> #define IOP_CACHED_LINK 0x0040
> +#define IOP_ROOT 0x0080
> /*
> * Keep mostly read-only and often accessed (especially for
> @@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode)
> return inode->i_opflags & IOP_MGTIME;
> }
>
> +static inline bool is_root_inode(const struct inode *inode)
> +{
> + return inode->i_opflags & IOP_ROOT;
> +}
> +
> extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> int flags, const char *dev_name, void *data,
> int (*fill_super)(struct super_block *, void *, int));
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists