linux-kernel - Re: linux-next: cgroup_mount() falls asleep forever

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140925032502.GB6903@mtj.dyndns.org>
Date:	Wed, 24 Sep 2014 23:25:02 -0400
From:	Tejun Heo <tj@...nel.org>
To:	Al Viro <viro@...IV.linux.org.uk>
Cc:	Andrey Wagin <avagin@...il.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Li Zefan <lizefan@...wei.com>
Subject: Re: linux-next: cgroup_mount() falls asleep forever

Hello, Al.

On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote:
> > Yeah, it's an ugly thing to work around vfs interface not very
> > conducive for filesystems which conditionally create or reuse
> > superblocks during mount.  There was a thread explaining what's going
> > on.  Looking up...
> > 
> >   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635
> 
> Umm...  I still don't get it.  Could you describe the screnario in which
> that percpu_ref_tryget_live() would be called and managed to fail?

That was for the initial fix and Li later added the pinning to fix
something else.  Let's wait for Li to chime in.  He knows this part
better.

> It smells to me like most of the problems here are simply due to having too
> many locks and not being able to decide where should they live relative to
> ->s_umount.  That cgroup_mutex thing feels like something way to coarse...
> You have it grabbed/dropped in

cgroup_mutex is the outer-most lock as far as cgroup is concerned and
not expected to nest under anything which is used by individual
controllers.  Most of what cgroup core does is low-freq managerial
things which don't benefit from finer grained locking and the mount
path is one of few surfaces where it interacts with outside in terms
of locking, so it's better to keep that path special and everything
else simpler.

> And that's a single system-wide mutex.  Plus there's a slew of workqueues
> and really unpleasant abuse of restart_syscall() tossed in for fun - what
> happens if some joker triggers that ->mount() _not_ from mount(2)?

For cgroup, mount is the userland-visible init interface.  It gotta be
called from userland.  It originally had internal retry loop but
syscall restart is simpler.  Reviving that loop isn't difficult if it
ever becomes necessary.

> Then there's a global rwsem nesting inside that sucker.  And there's another

The rwsem nests inside cgroup_mutex and exists to allow multiple
reader accesses to a particular data structure.

> mutex in fs/kernfs - also a global one.  Are the locking rules documented
> anywhere?  Lifetime rules, as well...

kernfs one shouldn't interact with anything outside kernfs.  Its
dependency is terminated within kernfs.

> Frankly, my first inclination here would be to try using sget() instead of
> scanning the list of roots.  How painful would it be to split the allocation
> and setup of roots, always allocate a new root and have sget() wait
> for fs shutdown in progress and decide whether it wants to reuse the existing
> one.  You can easily tell reuse existing vs. set up a new one from each other -
> just look at the root associated with the superblock you've got and check
> if it's the one you've allocated.  Freeing the damn thing if we'd reused
> an existing one and doing setup otherwise.
> 
> I realize that it won't do in such form; your for_each_subsys() loop in there
> really depends on holding cgroup_mutex all the way through.  But do we really
> need it there?  Would just skipping the ones that doomed in rebind_subsystems()
> suffice?

At this point, cgroup core locking is heavily focused on simplicity -
cgroup_mutex for the whole thing and css_set_rwsem for css_set reader
accesses.  It works out pretty well for the rest of the code but the
mount path does get tricky.  We can definitely relax / separate out
locking on subsys iteration for mount path but if possible I'd prefer
to pay isolated complexity there instead of spilling it to other
places.

Anyways, let's wait for Li.  At least nobody reported breakage before
the recent commit, so we can revert the offending commit for the short
term.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/