linux-kernel - Re: [syzbot] [kernfs?] possible deadlock in kernfs_fop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240405-speerwerfen-quetschen-d3de254cf830@brauner>
Date: Fri, 5 Apr 2024 12:47:45 +0200
From: Christian Brauner <brauner@...nel.org>
To: Amir Goldstein <amir73il@...il.com>
Cc: Al Viro <viro@...iv.linux.org.uk>, 
	syzbot <syzbot+9a5b0ced8b1bfb238b56@...kaller.appspotmail.com>, gregkh@...uxfoundation.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, syzkaller-bugs@...glegroups.com, tj@...nel.org, 
	valesini@...dex-team.ru, Christoph Hellwig <hch@....de>, Jan Kara <jack@...e.cz>, 
	Miklos Szeredi <miklos@...redi.hu>
Subject: Re: [syzbot] [kernfs?] possible deadlock in kernfs_fop_llseek

On Thu, Apr 04, 2024 at 12:33:40PM +0300, Amir Goldstein wrote:
> On Thu, Apr 4, 2024 at 11:21 AM Al Viro <viro@...iv.linux.org.uk> wrote:
> >
> > On Thu, Apr 04, 2024 at 09:11:22AM +0100, Al Viro wrote:
> > > On Thu, Apr 04, 2024 at 09:54:35AM +0300, Amir Goldstein wrote:
> > > >
> > > > In the lockdep dependency chain, overlayfs inode lock is taken
> > > > before kernfs internal of->mutex, where kernfs (sysfs) is the lower
> > > > layer of overlayfs, which is sane.
> > > >
> > > > With /sys/power/resume (and probably other files), sysfs also
> > > > behaves as a stacking filesystem, calling vfs helpers, such as
> > > > lookup_bdev() -> kern_path(), which is a behavior of a stacked
> > > > filesystem, without all the precautions that comes with behaving
> > > > as a stacked filesystem.
> > >
> > > No.  This is far worse than anything stacked filesystems do - it's
> > > an arbitrary pathname resolution while holding a lock.
> > > It's not local.  Just about anything (including automounts, etc.)
> > > can be happening there and it pushes the lock in question outside
> > > of *ALL* pathwalk-related locks.  Pathname doesn't have to
> > > resolve to anything on overlayfs - it can just go through
> > > a symlink on it, or walk into it and traverse a bunch of ..
> > > afterwards, etc.
> > >
> > > Don't confuse that with stacking - it's not even close.
> > > You can't use that anywhere near overlayfs layers.
> > >
> > > Maybe isolate it into a separate filesystem, to be automounted
> > > on /sys/power.  And make anyone playing with overlayfs with
> > > sysfs as a layer mount the damn thing on top of power/ in your
> > > overlayfs.  But using that thing as a part of layer is
> > > a non-starter.
> 
> I don't follow what you are saying.
> Which code is in non-starter violation?
> kernfs for calling lookup_bdev() with internal of->mutex held?
> Overlayfs for allowing sysfs as a lower layer and calling
> vfs_llseek(lower_sysfs_file,...) during copy up while ovl inode is held
> for legit reasons (e.g. from ovl_rename())?
> 
> >
> > Incidentally, why do you need to lock overlayfs inode to call
> > vfs_llseek() on the underlying file?  It might (or might not)
> > need to lock the underlying file (for things like ->i_size,
> > etc.), but that will be done by ->llseek() instance and it
> > would deal with the inode in the layer, not overlayfs one.
> 
> We do not (anymore) lock ovl inode in ovl_llseek(), see:
> b1f9d3858f72 ovl: use ovl_inode_lock in ovl_llseek()
> but ovl inode is held in operations (e.g. ovl_rename)
> which trigger copy up and call vfs_llseek() on the lower file.
> 
> >
> > Similar question applies to ovl_write_iter() - why do you
> > need to hold the overlayfs inode locked during the call of
> > backing_file_write_iter()?
> >
> 
> Not sure. This question I need to defer to Miklos.
> I see in several places the pattern:
>         inode_lock(inode);
>         /* Update mode */
>         ovl_copyattr(inode);
>         ret = file_remove_privs(file);
> ...
>         /* Update size */
>         ovl_file_modified(file);
> ...
>         inode_unlock(inode);
> 
> so it could be related to atomic remove privs and update mtime,
> but possibly we could convert all of those inode_lock() to
> ovl_inode_lock() (i.e. internal lock below vfs inode lock).
> 
> [...]
> > Consider the scenario when unlink() is called on that sucker
> > during the write() that triggers that pathwalk.  We have
> >
> > unlink: blocked on overlayfs inode of file, while holding the parent directory.
> > write: holding the overlayfs inode of file and trying to resolve a pathname
> > that contains .../power/suspend_stats/../../...; blocked on attempt to lock
> > parent so we could do a lookup in it.
> 
> This specifically cannot happen because sysfs is not allowed as an
> upper layer only as a lower layer, so overlayfs itself will not be writing to
> /sys/power/resume.

I don't understand that part. If overlayfs uses /sys/power/ as a lower
layer it can open and write to /sys/power/resume, no?

Honestly, why don't you just block /sys/power from appearing in any
layer in overlayfs? This seems like such a niche use-case that it's so
unlikely that this will be used that I would just try and kill it.

If you do it like Al suggested and switch it to an automount you get
that for free. But I guess you can also just block it without that.

(Frankly, I find it weird that sysfs is allowed as a layer in any case. I
completely forgot about this. Imho, both procfs and sysfs should not be
usable as a lower layer - procfs is, I know - and then only select parts
should be like /sys/fs/cgroup or sm where I can see the container people
somehow using this to mess with the cgroup tree or something.)

> 
> >
> > No llseek involved anywhere, kernfs of->mutex held, but not contended,
> > deadlock purely on ->i_rwsem of overlayfs inodes.
> >
> > Holding overlayfs inode locked during the call of lookup_bdev() is really
> > no-go.
> 
> Yes. I see that, but how can this be resolved?
> 
> If the specific file /sys/power/resume will not have FMODE_LSEEK,
> then ovl_copy_up_file() will not try to skip_hole on it at all and the
> llseek lock dependency will be averted.
> 
> The question is whether opt-out of FMODE_LSEEK for /sys/power/resume
> will break anything or if we should mark seq files in another way so that
> overlayfs would not try to seek_hole on any of them categorically.
> 
> Thanks,
> Amir.