linux-kernel - Re: [PATCH] ovl: check before dereferencing s

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxgSQPQ6Vx4MLECPPxn35m8--1iL7_rUFEobBuROfEzq_A@mail.gmail.com>
Date: Wed, 17 Sep 2025 13:07:45 +0200
From: Amir Goldstein <amir73il@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Jakub Acs <acsjakub@...zon.de>, linux-unionfs@...r.kernel.org, 
	Miklos Szeredi <miklos@...redi.hu>, linux-kernel@...r.kernel.org, stable@...r.kernel.org, 
	Christian Brauner <brauner@...nel.org>, linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] ovl: check before dereferencing s_root field

On Wed, Sep 17, 2025 at 11:25 AM Jan Kara <jack@...e.cz> wrote:
>
> On Tue 16-09-25 15:29:35, Amir Goldstein wrote:
> > On Tue, Sep 16, 2025 at 1:30 PM Jan Kara <jack@...e.cz> wrote:
> > >
> > > On Mon 15-09-25 17:29:40, Amir Goldstein wrote:
> > > > On Mon, Sep 15, 2025 at 4:07 PM Jan Kara <jack@...e.cz> wrote:
> > > > > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> > > > > > > index 83f80fdb1567..424c73188e06 100644
> > > > > > > --- a/fs/overlayfs/export.c
> > > > > > > +++ b/fs/overlayfs/export.c
> > > > > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode)
> > > > > > >         if (!ovl_inode_lower(inode))
> > > > > > >                 return 0;
> > > > > > >
> > > > > > > +       if (!inode->i_sb->s_root)
> > > > > > > +               return -ENOENT;
> > > > > >
> > > > > > For a filesystem method to have to check that its own root is still alive sounds
> > > > > > like the wrong way to me.
> > > > > > That's one of the things that should be taken for granted by fs code.
> > > > > >
> > > > > > I don't think this is an overlayfs specific issue, because other fs would be
> > > > > > happy if encode_fh() would be called with NULL sb->s_root.
> > > > >
> > > > > Actually, I don't see where that would blow up? Generally references to
> > > > > sb->s_root in filesystems outside of mount / remount code are pretty rare.
> > > > > Also most of the code should be unreachable by the time we set sb->s_root
> > > > > to NULL because there are no open files at that moment, no exports etc. But
> > > > > as this report shows, there are occasional surprises (I remember similar
> > > > > issue with ext4 sysfs files handlers using s_root without checking couple
> > > > > years back).
> > > > >
> > > >
> > > > I am not sure that I understand what you are arguing for.
> > > > I did a very naive grep s_root fs/*/export.c and quickly found:
> > >
> > > You're better with grep than me ;). I was grepping for '->s_root' as well
> > > but all the hits I had looked into were related to mounting and similar and
> > > eventually I got bored. Restricting the grep to export ops indeed shows
> > > ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

As far as I can tell, ceph uses s_root only in decode_fh methods.

ovl and gfs2 only want to know for an inode if it is the root inode,
they do not strictly need to dereference s_root for that purpose.
(see patch below)

> > >
> > > > static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len,
> > > >                           struct inode *parent)
> > > > {
> > > > ...
> > > >         if (!parent || inode == d_inode(sb->s_root))
> > > >                 return *len;
> > > >
> > > > So it's not an overlayfs specific issue, just so happens that zysbot
> > > > likes to test overlayfs.
> > > >
> > > > Are you suggesting that we fix all of those one by one?
> > >
> > > No. I agree we need to figure out a way to make sure export ops are not
> > > called on a filesystem being unmounted. Standard open_by_handle() or NFS
> > > export cannot race with generic_shutdown_super() (they hold the fs mounted)
> > > so fsnotify is a special case here.
> > >
> > > I actually wonder if fanotify event (e.g. from inode deletion postponed to
> > > some workqueue or whatever) cannot race with umount as well and cause the
> > > same problem...
> > >
> >
> > Oy. I was thinking that all event happen when holding some mnt ref
> > but yeh fsnotify_inoderemove() does look like it could be a problem
> > from sb shutdown context.
>
> Well, but there's also fun like fs/kernfs/file.c: kernfs_notify() which
> queues work which calls fsnotify for some inodes and, frankly, proper
> exclusion with umount seems non-existent there (but I can be missing
> something).

Ouch!

>
> Also we have fsnotify_sb_error() which can happen practically anytime
> before the fs gets fully shutdown in ->kill_sb() and may try to encode fh
> of an inode.
>

Bigger ouch because silencing this event is not an option.

> So there are not many cases where this can happen but enough that I'd say
> that handling some events specially to avoid encoding fh on fs while it is
> unmounted is fragile and prone to breaking again sooner or later.
>
> > How about skipping fsnotify_inoderemove() in case sb is in shutdown?
>
> Also how would you like to handle that in a race-free manner? We'd need to
> hold s_umount for that which we cannot really afford in that context. But
> maybe you have some better idea...
>

I was only thinking about this code path:

generic_shutdown_super()
  shrink_dcache_for_umount()
    ...
      __dentry_kill()
        dentry_unlink_inode()

This is supposed to be the last dput of all remaining dentries
and I don't think a deferred unlink should be expected in that case.

But I realize now that you mean delayed unlink from another context
which races with shutdown.

> > > > > > Can we change the order of generic_shutdown_super() so that
> > > > > > fsnotify_sb_delete(sb) is called before setting s_root to NULL?
> > > > > >
> > > > > > Or is there a better solution for this race?
> > > > >
> > > > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL:
> > > > > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after
> > > > > evict_inodes")) we've moved the call after evict_inodes() because otherwise
> > > > > we were just wasting cycles scanning many inodes without watches. So moving
> > > > > it earlier wouldn't be great...
> > > >
> > > > Yes, I noticed that and I figured there were subtleties.
> > >
> > > Right. After thinking more about it I think calling fsnotify_sb_delete()
> > > earlier is the only practical choice we have (not clearing sb->s_root isn't
> > > much of an option - we need to prune all dentries to quiesce the filesystem
> > > and leaving s_root alive would create odd corner cases). But you don't want
> > > to be iterating millions of inodes just to clear couple of marks so we'll
> > > have to figure out something more clever there.
> >
> > I think we only need to suppress the fsnotify_inoderemove() call.
> > It sounds doable and very local to fs/super.c.
> >
> > Regarding show_mark_fhandle() WDYT about my suggestion to
> > guard it with super_trylock_shared()?
>
> Yes, super_trylock_shared() for that callsite looks like a fine solution
> for that call site. Occasional random failures in encoding fh because the
> trylock fails are unlikely to have any bad consequences there. But I think
> we need to figure out other possibly racing call-sites as well first.
>

Might something naive as this be enough?

Thanks,
Amir.

diff --git a/fs/dcache.c b/fs/dcache.c
index 60046ae23d514..8c9d0d6bb0045 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)

        if (root_inode) {
                res = d_alloc_anon(root_inode->i_sb);
-               if (res)
+               if (res) {
+                       root_inode->i_opflags |= IOP_ROOT;
                        d_instantiate(res, root_inode);
-               else
+               } else {
                        iput(root_inode);
+               }
        }
        return res;
 }
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index 3334c394ce9cb..809a09c6a89e0 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32
*p, int *len,
        fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF);
        *len = GFS2_SMALL_FH_SIZE;

-       if (!parent || inode == d_inode(sb->s_root))
+       if (!parent || is_root_inode(inode))
                return *len;

        ip = GFS2_I(parent);
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
index 83f80fdb15674..7827c63354ad5 100644
--- a/fs/overlayfs/export.c
+++ b/fs/overlayfs/export.c
@@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode)
         * Root is never indexed, so if there's an upper layer, encode upper for
         * root.
         */
-       if (inode == d_inode(inode->i_sb->s_root))
+       if (is_root_inode(inode))
                return 0;

        /*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec867f112fd5f..ed84379aa06ca 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl)
 #define IOP_DEFAULT_READLINK   0x0010
 #define IOP_MGTIME     0x0020
 #define IOP_CACHED_LINK        0x0040
+#define IOP_ROOT       0x0080
  /*
  * Keep mostly read-only and often accessed (especially for
@@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode)
        return inode->i_opflags & IOP_MGTIME;
 }

+static inline bool is_root_inode(const struct inode *inode)
+{
+       return inode->i_opflags & IOP_ROOT;
+}
+
 extern struct dentry *mount_bdev(struct file_system_type *fs_type,
        int flags, const char *dev_name, void *data,
        int (*fill_super)(struct super_block *, void *, int));