[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxgSQPQ6Vx4MLECPPxn35m8--1iL7_rUFEobBuROfEzq_A@mail.gmail.com>
Date: Wed, 17 Sep 2025 13:07:45 +0200
From: Amir Goldstein <amir73il@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Jakub Acs <acsjakub@...zon.de>, linux-unionfs@...r.kernel.org,
Miklos Szeredi <miklos@...redi.hu>, linux-kernel@...r.kernel.org, stable@...r.kernel.org,
Christian Brauner <brauner@...nel.org>, linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] ovl: check before dereferencing s_root field
On Wed, Sep 17, 2025 at 11:25 AM Jan Kara <jack@...e.cz> wrote:
>
> On Tue 16-09-25 15:29:35, Amir Goldstein wrote:
> > On Tue, Sep 16, 2025 at 1:30 PM Jan Kara <jack@...e.cz> wrote:
> > >
> > > On Mon 15-09-25 17:29:40, Amir Goldstein wrote:
> > > > On Mon, Sep 15, 2025 at 4:07 PM Jan Kara <jack@...e.cz> wrote:
> > > > > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> > > > > > > index 83f80fdb1567..424c73188e06 100644
> > > > > > > --- a/fs/overlayfs/export.c
> > > > > > > +++ b/fs/overlayfs/export.c
> > > > > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode)
> > > > > > > if (!ovl_inode_lower(inode))
> > > > > > > return 0;
> > > > > > >
> > > > > > > + if (!inode->i_sb->s_root)
> > > > > > > + return -ENOENT;
> > > > > >
> > > > > > For a filesystem method to have to check that its own root is still alive sounds
> > > > > > like the wrong way to me.
> > > > > > That's one of the things that should be taken for granted by fs code.
> > > > > >
> > > > > > I don't think this is an overlayfs specific issue, because other fs would be
> > > > > > happy if encode_fh() would be called with NULL sb->s_root.
> > > > >
> > > > > Actually, I don't see where that would blow up? Generally references to
> > > > > sb->s_root in filesystems outside of mount / remount code are pretty rare.
> > > > > Also most of the code should be unreachable by the time we set sb->s_root
> > > > > to NULL because there are no open files at that moment, no exports etc. But
> > > > > as this report shows, there are occasional surprises (I remember similar
> > > > > issue with ext4 sysfs files handlers using s_root without checking couple
> > > > > years back).
> > > > >
> > > >
> > > > I am not sure that I understand what you are arguing for.
> > > > I did a very naive grep s_root fs/*/export.c and quickly found:
> > >
> > > You're better with grep than me ;). I was grepping for '->s_root' as well
> > > but all the hits I had looked into were related to mounting and similar and
> > > eventually I got bored. Restricting the grep to export ops indeed shows
> > > ceph, gfs2 and overlayfs are vulnerable to this kind of problem.
As far as I can tell, ceph uses s_root only in decode_fh methods.
ovl and gfs2 only want to know for an inode if it is the root inode,
they do not strictly need to dereference s_root for that purpose.
(see patch below)
> > >
> > > > static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len,
> > > > struct inode *parent)
> > > > {
> > > > ...
> > > > if (!parent || inode == d_inode(sb->s_root))
> > > > return *len;
> > > >
> > > > So it's not an overlayfs specific issue, just so happens that zysbot
> > > > likes to test overlayfs.
> > > >
> > > > Are you suggesting that we fix all of those one by one?
> > >
> > > No. I agree we need to figure out a way to make sure export ops are not
> > > called on a filesystem being unmounted. Standard open_by_handle() or NFS
> > > export cannot race with generic_shutdown_super() (they hold the fs mounted)
> > > so fsnotify is a special case here.
> > >
> > > I actually wonder if fanotify event (e.g. from inode deletion postponed to
> > > some workqueue or whatever) cannot race with umount as well and cause the
> > > same problem...
> > >
> >
> > Oy. I was thinking that all event happen when holding some mnt ref
> > but yeh fsnotify_inoderemove() does look like it could be a problem
> > from sb shutdown context.
>
> Well, but there's also fun like fs/kernfs/file.c: kernfs_notify() which
> queues work which calls fsnotify for some inodes and, frankly, proper
> exclusion with umount seems non-existent there (but I can be missing
> something).
Ouch!
>
> Also we have fsnotify_sb_error() which can happen practically anytime
> before the fs gets fully shutdown in ->kill_sb() and may try to encode fh
> of an inode.
>
Bigger ouch because silencing this event is not an option.
> So there are not many cases where this can happen but enough that I'd say
> that handling some events specially to avoid encoding fh on fs while it is
> unmounted is fragile and prone to breaking again sooner or later.
>
> > How about skipping fsnotify_inoderemove() in case sb is in shutdown?
>
> Also how would you like to handle that in a race-free manner? We'd need to
> hold s_umount for that which we cannot really afford in that context. But
> maybe you have some better idea...
>
I was only thinking about this code path:
generic_shutdown_super()
shrink_dcache_for_umount()
...
__dentry_kill()
dentry_unlink_inode()
This is supposed to be the last dput of all remaining dentries
and I don't think a deferred unlink should be expected in that case.
But I realize now that you mean delayed unlink from another context
which races with shutdown.
> > > > > > Can we change the order of generic_shutdown_super() so that
> > > > > > fsnotify_sb_delete(sb) is called before setting s_root to NULL?
> > > > > >
> > > > > > Or is there a better solution for this race?
> > > > >
> > > > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL:
> > > > > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after
> > > > > evict_inodes")) we've moved the call after evict_inodes() because otherwise
> > > > > we were just wasting cycles scanning many inodes without watches. So moving
> > > > > it earlier wouldn't be great...
> > > >
> > > > Yes, I noticed that and I figured there were subtleties.
> > >
> > > Right. After thinking more about it I think calling fsnotify_sb_delete()
> > > earlier is the only practical choice we have (not clearing sb->s_root isn't
> > > much of an option - we need to prune all dentries to quiesce the filesystem
> > > and leaving s_root alive would create odd corner cases). But you don't want
> > > to be iterating millions of inodes just to clear couple of marks so we'll
> > > have to figure out something more clever there.
> >
> > I think we only need to suppress the fsnotify_inoderemove() call.
> > It sounds doable and very local to fs/super.c.
> >
> > Regarding show_mark_fhandle() WDYT about my suggestion to
> > guard it with super_trylock_shared()?
>
> Yes, super_trylock_shared() for that callsite looks like a fine solution
> for that call site. Occasional random failures in encoding fh because the
> trylock fails are unlikely to have any bad consequences there. But I think
> we need to figure out other possibly racing call-sites as well first.
>
Might something naive as this be enough?
Thanks,
Amir.
diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d514..8c9d0d6bb0045 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
if (root_inode) {
res = d_alloc_anon(root_inode->i_sb);
- if (res)
+ if (res) {
+ root_inode->i_opflags |= IOP_ROOT;
d_instantiate(res, root_inode);
- else
+ } else {
iput(root_inode);
+ }
}
return res;
}
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index 3334c394ce9cb..809a09c6a89e0 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32
*p, int *len,
fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF);
*len = GFS2_SMALL_FH_SIZE;
- if (!parent || inode == d_inode(sb->s_root))
+ if (!parent || is_root_inode(inode))
return *len;
ip = GFS2_I(parent);
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index 83f80fdb15674..7827c63354ad5 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode)
* Root is never indexed, so if there's an upper layer, encode upper for
* root.
*/
- if (inode == d_inode(inode->i_sb->s_root))
+ if (is_root_inode(inode))
return 0;
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec867f112fd5f..ed84379aa06ca 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl)
#define IOP_DEFAULT_READLINK 0x0010
#define IOP_MGTIME 0x0020
#define IOP_CACHED_LINK 0x0040
+#define IOP_ROOT 0x0080
/*
* Keep mostly read-only and often accessed (especially for
@@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode)
return inode->i_opflags & IOP_MGTIME;
}
+static inline bool is_root_inode(const struct inode *inode)
+{
+ return inode->i_opflags & IOP_ROOT;
+}
+
extern struct dentry *mount_bdev(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data,
int (*fill_super)(struct super_block *, void *, int));
Powered by blists - more mailing lists