linux-kernel - Re: memcg creates an unkillable task in 3.2-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87r4eh70yg.fsf@xmission.com>
Date:	Mon, 29 Jul 2013 10:03:35 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Tejun Heo <tj@...nel.org>
Cc:	Michal Hocko <mhocko@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	cgroups@...r.kernel.org, containers@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, kent.overstreet@...il.com,
	Li Zefan <lizefan@...wei.com>,
	Glauber Costa <glommer@...il.com>,
	Johannes Weiner <hannes@...xchg.org>
Subject: Re: memcg creates an unkillable task in 3.2-rc2

Tejun Heo <tj@...nel.org> writes:

> Hello,
>
> On Mon, Jul 29, 2013 at 11:51:09AM +0200, Michal Hocko wrote:
>> Isn't this a bug in freezer then? I am not familiar with the freezer
>> much but memcg oom handling seems correct to me. The task is sleeping
>> KILLABLE and fatal_signal_pending in mem_cgroup_handle_oom will tell us
>> to bypass the charge and let the taks go away.
>
> Is the problem a frozen task not being killed even when SIGKILL is
> received?  If so, it is a known problem and a side-effect of
> cgroup_freezer (ab)using and making the existing power management
> freezer visible to userland without really thinking about the
> implications.  :(

Something like that.  I need to look at it in a little more detail.
The idiom someone adopted to atomically kill all of the tasks in a
cgroup is to.  Freeze all of the tasks.  Send them SIGKILL. unfreeze
all of the tasks.

The freezing actually fails in this case so I don't know what is
happening.

So this is not a simple matter of a frozen task not dying when SIGKILL
is received.  For the most part not dying when SIGKILL is received seems
like correct behavior for a frozne task.  Certainly it is correct
behavior for any other signal.

The issue is that the tasks don't freeze or that when thawed the SIGKILL
is still ignored.  It seems a wake up is being missed in there somewhere.

> So, yeah, if you use cgroup_freezer now, the tasks will get stuck in
> states which aren't well defined when visible from userland and will
> just stay there until unfrozen no matter what.  Yet another reason
> I'll be screaming like a banshee at anyone who says that cgroup is
> built to delegate subtree access rights to !root users.

Yes.  From the looks of the looks of it the cgroup implementation is
rather badly borked right now, and definitely not up to the standards of
the other core pieces of the kernel.  One of the reasons I was rather
apalled when systemd started using them.  Until the code actually works
reliably and the races are removed most people's systems would be much
better off with cgroups compiled out.

A single unified hierarchy is a really nasty idea for the same set of
reasons. You have to recompile to disable a controller to see if it that
controller's bugs are what are causing problems on your production
system.  Compiles or even just a reboot is a very heavy hammer to ask
people to use when they are triaging a problem.

That said semantically having more than single process controls for all
of user space are very desirable.  Until we have code that is safe to
use giving it any additional exposure seems like a bad idea.

> It's on the to-do list but a very long term one.  Right now, if you
> combine userland OOM handling with freezer and whatnot, it'd be pretty
> easy to get into trouble.

Thanks for the heads up.  Right now this is looking like a regression
but it might just be that my test machine has the right combination
of racying pixies to trigger a long standing bug.

I am also seeing what looks like a leak somewhere in the cgroup code as
well.  After some runs of the same reproducer I get into a state where
after everything is clean up.  All of the control groups have been
removed and the cgroup filesystem is unmounted, I can mount a cgroup
filesystem with that same combindation of subsystems, but I can't mount
a cgroup filesystem with any of those subsystems in any other
combination.  So I am guessing that the superblock is from the original
mounting is still lingering for some reason.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/