[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140413053956.GM18016@ZenIV.linux.org.uk>
Date: Sun, 13 Apr 2014 06:39:56 +0100
From: Al Viro <viro@...IV.linux.org.uk>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
"Serge E. Hallyn" <serge@...lyn.com>,
Linux-Fsdevel <linux-fsdevel@...r.kernel.org>,
Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andy Lutomirski <luto@...capital.net>,
Rob Landley <rob@...dley.net>,
Miklos Szeredi <miklos@...redi.hu>,
Christoph Hellwig <hch@...radead.org>,
Karel Zak <kzak@...hat.com>,
"J. Bruce Fields" <bfields@...ldses.org>,
Fengguang Wu <fengguang.wu@...el.com>
Subject: Re: [RFC][PATCH] vfs: In mntput run deactivate_super on a shallow
stack.
On Sat, Apr 12, 2014 at 03:15:39PM -0700, Eric W. Biederman wrote:
> Can you explain which scenario you are thinking about with respect to a
> failed modprobe?
Completely made up example:
static struct file_system_type foofs = {
.mount = mount_foo,
.kill_sb = kill_foo,
};
static struct vfsmount *mnt;
static __init int foo_init(void)
{
int err;
err = init_some();
if (err < 0)
return err;
mnt = kern_mount(&foofs);
if (IS_ERR(mnt)) {
uninit_some();
return PTR_ERR(mnt);
}
err = init_some_more();
if (err < 0) {
kern_umount(mnt);
uninit_some();
return err;
}
printk(KERN_INFO "loaded foo");
return 0;
}
Now, think what happens if init_some_more() in the above fails. With the
current mntput() semantics, everything works. After making mntput() (from
kern_umount()) delayed until the return to userland, we end up with attempt
to call kill_foo() after the memory where it code sits gets freed. For that
matter, by that point we are not even guaranteed to reach it, since it
comes as mnt->mnt_sb->s_type->kill_sb() and s_type points to freed memory.
I'm not saying that we have something that would closely resemble this
example, but it's not hard to vary it in a lot of ways, keeping the same
problem. Basically, you need to audit all paths leading from failure
exits in some module_init() to mntput() and figure out if delaying the
effect of that mntput() would be safe there (== doesn't get delayed past
the point where we destroy something needed for that fs shutdown).
It's not *that* horrible, since not too many modules out there are
declaring any fs types, but it needs to be done. In theory, you could
also fall prey to something like this:
type = get_fs_type("proc");
ns = kmalloc(...);
/* fill *ns */
mnt = kern_mount_data(type, p);
...
if (error) {
kern_unmount(mnt);
kfree(p);
put_filesystem(type);
}
possibly with get_fs_type() replaced with some other way to get that
pointer to fs type (defined elsewhere). E.g. for procfs it could
be, say, task_active_pid_ns(current)->proc_mnt->mnt_sb->s_type, etc.
Again, it's not impossible to audit (there's not a lot of places where
struct file_system_type * is ever stored, there are few instances of
struct file_system_type, all statically allocated, etc.), but it's
a non-trivial amount of work. And I honestly don't know if we have
any such places right now. Moreover, unless you feel like repeating
that kind of audit every merge window, we'll need a some way of dealing
with such situations. Something like flush_pending_mntput(fs_type), for
example, documented as barrier to be used in such places might do, but
if you can think of something more fool-proof...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists