linux-kernel - Re: [PATCH] ovl: check before dereferencing s

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxg6w1JDE9ERChC80kkGsTSTx4rAj5b_ro7tNKmpQ29osA@mail.gmail.com>
Date: Wed, 17 Sep 2025 19:51:25 +0200
From: Amir Goldstein <amir73il@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Jakub Acs <acsjakub@...zon.de>, linux-unionfs@...r.kernel.org, 
	Miklos Szeredi <miklos@...redi.hu>, linux-kernel@...r.kernel.org, stable@...r.kernel.org, 
	Christian Brauner <brauner@...nel.org>, linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] ovl: check before dereferencing s_root field

On Wed, Sep 17, 2025 at 4:42 PM Jan Kara <jack@...e.cz> wrote:
>
> On Wed 17-09-25 13:07:45, Amir Goldstein wrote:
> > On Wed, Sep 17, 2025 at 11:25 AM Jan Kara <jack@...e.cz> wrote:
> > > On Tue 16-09-25 15:29:35, Amir Goldstein wrote:
> > > > On Tue, Sep 16, 2025 at 1:30 PM Jan Kara <jack@...e.cz> wrote:
> > > > >
> > > > > On Mon 15-09-25 17:29:40, Amir Goldstein wrote:
> > > > > > On Mon, Sep 15, 2025 at 4:07 PM Jan Kara <jack@...e.cz> wrote:
> > > > > > > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c
> > > > > > > > > index 83f80fdb1567..424c73188e06 100644
> > > > > > > > > --- a/fs/overlayfs/export.c
> > > > > > > > > +++ b/fs/overlayfs/export.c
> > > > > > > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode)
> > > > > > > > >         if (!ovl_inode_lower(inode))
> > > > > > > > >                 return 0;
> > > > > > > > >
> > > > > > > > > +       if (!inode->i_sb->s_root)
> > > > > > > > > +               return -ENOENT;
> > > > > > > >
> > > > > > > > For a filesystem method to have to check that its own root is still alive sounds
> > > > > > > > like the wrong way to me.
> > > > > > > > That's one of the things that should be taken for granted by fs code.
> > > > > > > >
> > > > > > > > I don't think this is an overlayfs specific issue, because other fs would be
> > > > > > > > happy if encode_fh() would be called with NULL sb->s_root.
> > > > > > >
> > > > > > > Actually, I don't see where that would blow up? Generally references to
> > > > > > > sb->s_root in filesystems outside of mount / remount code are pretty rare.
> > > > > > > Also most of the code should be unreachable by the time we set sb->s_root
> > > > > > > to NULL because there are no open files at that moment, no exports etc. But
> > > > > > > as this report shows, there are occasional surprises (I remember similar
> > > > > > > issue with ext4 sysfs files handlers using s_root without checking couple
> > > > > > > years back).
> > > > > > >
> > > > > >
> > > > > > I am not sure that I understand what you are arguing for.
> > > > > > I did a very naive grep s_root fs/*/export.c and quickly found:
> > > > >
> > > > > You're better with grep than me ;). I was grepping for '->s_root' as well
> > > > > but all the hits I had looked into were related to mounting and similar and
> > > > > eventually I got bored. Restricting the grep to export ops indeed shows
> > > > > ceph, gfs2 and overlayfs are vulnerable to this kind of problem.
> >
> > As far as I can tell, ceph uses s_root only in decode_fh methods.
>
> True. But ceph also uses d_find_alias() in ceph_encode_snapfh() which could
> race with shrink_dcache_for_umount()->do_one_tree() and trigger:
>
>         WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} "
>                         " still in use (%d) [unmount of %s %s]\n",
>
> > ovl and gfs2 only want to know for an inode if it is the root inode,
> > they do not strictly need to dereference s_root for that purpose.
> > (see patch below)
> >
> > > So there are not many cases where this can happen but enough that I'd say
> > > that handling some events specially to avoid encoding fh on fs while it is
> > > unmounted is fragile and prone to breaking again sooner or later.
> > >
> > > > How about skipping fsnotify_inoderemove() in case sb is in shutdown?
> > >
> > > Also how would you like to handle that in a race-free manner? We'd need to
> > > hold s_umount for that which we cannot really afford in that context. But
> > > maybe you have some better idea...
> > >
> >
> > I was only thinking about this code path:
> >
> > generic_shutdown_super()
> >   shrink_dcache_for_umount()
> >     ...
> >       __dentry_kill()
> >         dentry_unlink_inode()
> >
> > This is supposed to be the last dput of all remaining dentries
> > and I don't think a deferred unlink should be expected in that case.
>
> I see.
>
> > But I realize now that you mean delayed unlink from another context
> > which races with shutdown.
>
> Yes, I've meant that.
>
> > > > > > > > Can we change the order of generic_shutdown_super() so that
> > > > > > > > fsnotify_sb_delete(sb) is called before setting s_root to NULL?
> > > > > > > >
> > > > > > > > Or is there a better solution for this race?
> > > > > > >
> > > > > > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL:
> > > > > > > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after
> > > > > > > evict_inodes")) we've moved the call after evict_inodes() because otherwise
> > > > > > > we were just wasting cycles scanning many inodes without watches. So moving
> > > > > > > it earlier wouldn't be great...
> > > > > >
> > > > > > Yes, I noticed that and I figured there were subtleties.
> > > > >
> > > > > Right. After thinking more about it I think calling fsnotify_sb_delete()
> > > > > earlier is the only practical choice we have (not clearing sb->s_root isn't
> > > > > much of an option - we need to prune all dentries to quiesce the filesystem
> > > > > and leaving s_root alive would create odd corner cases). But you don't want
> > > > > to be iterating millions of inodes just to clear couple of marks so we'll
> > > > > have to figure out something more clever there.
> > > >
> > > > I think we only need to suppress the fsnotify_inoderemove() call.
> > > > It sounds doable and very local to fs/super.c.
> > > >
> > > > Regarding show_mark_fhandle() WDYT about my suggestion to
> > > > guard it with super_trylock_shared()?
> > >
> > > Yes, super_trylock_shared() for that callsite looks like a fine solution
> > > for that call site. Occasional random failures in encoding fh because the
> > > trylock fails are unlikely to have any bad consequences there. But I think
> > > we need to figure out other possibly racing call-sites as well first.
> > >
> >
> > Might something naive as this be enough?
>
> It looks like it should be good for the problems with gfs2 & overlayfs but
> it doesn't solve the problem with ceph and as Jakub writes there's a question
> whether we won't hit more problems later.
>
> I'm sorry for poking holes into your solutions. The more I look into this
> the more problems I find :-|
>

On the contrary, Thank you for shooting down my bad ideas ;)

> As I'm thinking about it I'm slowly leaning towards implementing a list of
> connectors per sb (so that we can quickly reclaim on umount). It seems
> stupid just for these corner cases but longer term we can also implement
> what Dave once suggested [1] so that fsnotify doesn't need to pin inodes in
> memory at all which should more that make up for the additional memory for
> inode connector members.
>
>                                                                 Honza
>
> [1] https://lore.kernel.org/linux-fsdevel/ZwXDzKGj6Bp28kYe@dread.disaster.area/
>

Interesting.
I'll wait for you to think this over.
If you think that it might take some time, maybe we should
apply the super_trylock_shared() band aid to show_mark_fhandle()
in the meantime. Whatever you think is right.

Thanks,
Amir.