linux-kernel - Re: linux-next: cgroup_mount() falls asleep forever

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5423ED2B.8010009@huawei.com>
Date:	Thu, 25 Sep 2014 18:23:39 +0800
From:	Zefan Li <lizefan@...wei.com>
To:	Tejun Heo <tj@...nel.org>
CC:	Al Viro <viro@...IV.linux.org.uk>, Andrey Wagin <avagin@...il.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: linux-next: cgroup_mount() falls asleep forever

On 2014/9/25 11:25, Tejun Heo wrote:
> Hello, Al.
> 
> On Thu, Sep 25, 2014 at 03:47:19AM +0100, Al Viro wrote:
>>> Yeah, it's an ugly thing to work around vfs interface not very
>>> conducive for filesystems which conditionally create or reuse
>>> superblocks during mount.  There was a thread explaining what's going
>>> on.  Looking up...
>>>
>>>   http://thread.gmane.org/gmane.linux.kernel.containers/27623/focus=10635
>>
>> Umm...  I still don't get it.  Could you describe the screnario in which
>> that percpu_ref_tryget_live() would be called and managed to fail?
> 

See this:

https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/commit/kernel/cgroup.c?id=3a32bd72d77058d768dbb38183ad517f720dd1bc

Without the above commit, A scenario like this can happen:

1. Thread A has dropped sb refcnt to 0 but hasn't called percpu_ref_kill(). 
2. At this time Thread B calls cgroup_mount() and percpu_ref_tryget() succeeds.
3. After a while Thread C calls cgroup_mount(), but percpu_ref_tryget() keeps
returning failure, because the ref has been killed by percpu_ref_kill().

So I had to use kernfs_pin_sb() to prevent thread B from getting the percpu refcnt.

Any better idea to fix this?

> That was for the initial fix and Li later added the pinning to fix
> something else.  Let's wait for Li to chime in.  He knows this part
> better.
> 
>> It smells to me like most of the problems here are simply due to having too
>> many locks and not being able to decide where should they live relative to
>> ->s_umount.  That cgroup_mutex thing feels like something way to coarse...
>> You have it grabbed/dropped in
> 
> cgroup_mutex is the outer-most lock as far as cgroup is concerned and
> not expected to nest under anything which is used by individual
> controllers.  Most of what cgroup core does is low-freq managerial
> things which don't benefit from finer grained locking and the mount
> path is one of few surfaces where it interacts with outside in terms
> of locking, so it's better to keep that path special and everything
> else simpler.
> 
>> And that's a single system-wide mutex.  Plus there's a slew of workqueues
>> and really unpleasant abuse of restart_syscall() tossed in for fun - what
>> happens if some joker triggers that ->mount() _not_ from mount(2)?
> 
> For cgroup, mount is the userland-visible init interface.  It gotta be
> called from userland.  It originally had internal retry loop but
> syscall restart is simpler.  Reviving that loop isn't difficult if it
> ever becomes necessary.
> 
>> Then there's a global rwsem nesting inside that sucker.  And there's another
> 
> The rwsem nests inside cgroup_mutex and exists to allow multiple
> reader accesses to a particular data structure.
> 
>> mutex in fs/kernfs - also a global one.  Are the locking rules documented
>> anywhere?  Lifetime rules, as well...
> 
> kernfs one shouldn't interact with anything outside kernfs.  Its
> dependency is terminated within kernfs.
> 
>> Frankly, my first inclination here would be to try using sget() instead of
>> scanning the list of roots.  How painful would it be to split the allocation
>> and setup of roots, always allocate a new root and have sget() wait
>> for fs shutdown in progress and decide whether it wants to reuse the existing
>> one.  You can easily tell reuse existing vs. set up a new one from each other -
>> just look at the root associated with the superblock you've got and check
>> if it's the one you've allocated.  Freeing the damn thing if we'd reused
>> an existing one and doing setup otherwise.
>>
>> I realize that it won't do in such form; your for_each_subsys() loop in there
>> really depends on holding cgroup_mutex all the way through.  But do we really
>> need it there?  Would just skipping the ones that doomed in rebind_subsystems()
>> suffice?
> 
> At this point, cgroup core locking is heavily focused on simplicity -
> cgroup_mutex for the whole thing and css_set_rwsem for css_set reader
> accesses.  It works out pretty well for the rest of the code but the
> mount path does get tricky.  We can definitely relax / separate out
> locking on subsys iteration for mount path but if possible I'd prefer
> to pay isolated complexity there instead of spilling it to other
> places.
> 
> Anyways, let's wait for Li.  At least nobody reported breakage before
> the recent commit, so we can revert the offending commit for the short
> term.
> 

That patch is wrong. We have to use both pinned_sb and new_sb, so
please revert it. :(

I think we need to put some test cases in tools/testing/selftests/, to
prevent this fragile thing from breaking again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/