linux-kernel - Re: cgroup information proc file format

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 5 Oct 2011 11:47:21 +0400
From:	Glauber Costa <glommer@...allels.com>
To:	Serge Hallyn <serge.hallyn@...onical.com>
CC:	Daniel Lezcano <daniel.lezcano@...e.fr>,
	<linux-kernel@...r.kernel.org>,
	Balbir Singh <bsingharora@...il.com>,
	Paul Menage <paul@...lmenage.org>
Subject: Re: cgroup information proc file format

On 10/04/2011 06:05 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@...allels.com):
> ...
>
>>> Can't we just introduce the
>>> /sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
>>> if cgroups are enabled and the task's memory cgroup != '/', return
>>> the data from that file?
>>
>> First: If we're doing that, why do we need that file in the first place?
>
> We might not :)  But we might, if we want to offer containers a choice of
> whether /proc/meminfo is the host's or the container's.

Hi,

Please allow me to clarify some points so we are in the same page (thus 
avoiding fragmentation =p )

Are you quoting /proc/meminfo as an example only, or are you concerned 
specifically with this file? I myself am talking about proc files in 
general.

We have to keep in mind that the myriad of them, convey different kinds 
of information, belong to different subsystems and have different 
expected behavior.

That is important because for some of them, what you state about only 
allowing a group of processes to see the resources they have makes 
sense. For others, maybe not.

>> The file is useful if we're bind mounting, but if we're
>> automatically displaying it according to any criteria, not that
>> interesting. Well, it would allow the root container to view it, so
>> maybe it is in fact interesting...
>>
>> As for cgroup != '/', I am not sure if it works. Well, for
>> containers, it works beautifully. But what we have in the kernel now
>> is a mechanism for resource control (cgroups) and a mechanism for
>> isolation (namespaces). Displaying data falls in the isolation
>> realm. There are users using just the resource control part (think
>> of systemd). I doubt they'd like to suddenly, after years expecting
>> system-wide info, read per-cgroup data when querying a /proc file.
>
> That's where the /sys/fs/cgroup/memory/memory.use_cgroup_as_proc file
> I mentioned below would come in.  The host could choose to give
> that application the host /proc/meminfo view.
I am sorry, I think I missed you mentioning this file.

Correct me if I am wrong, but it seems to me now that we agree that 
there should be a mechanism determining whether or not to automatically 
show cgroup-restrained values in proc files.

This is a key point for me. What is this mechanism, is less important, 
as long as it is a one-time shot.

>
> Still, if the applications you are thinking of are having their
> resources restricted, what harm would come of reporting their actual
> allotted resources in place of an artificially inflated number?
Think /proc/stat, the file I am working now, as an example.

Historically, this file shows, among other things, user ticks for all 
processes in the system. In a container system, we want this to 
represent only the set of processes inside a container.

But why on earth can we assume that everybody, in all use cases, 
wouldn't be harmed by having just your process' ticks displayed? I don't 
think we can.

Note that people are now using cgroups for other things, (think systemd).

They can serve as process grouping, simple restriction, etc.
So the less we assume, the better.

>
>> So, because I'm all for automatic, is that I am proposing this. I
>> think we need a mechanism to tie a cgroup to a namespace (or many,
>> one of each kind).
>>
>> I myself can settle down for:
>>    * If namespace != '/' =>  show cgroup information instead of
>>      system-wide. (What do you think?)
>
> I don't like it  :)
>
> The namespaces are about name->object relations, not just about
> isolation.  In contrast, the cgroups are precisely about resource
> limitations.
Right.

>> The only reason I proposed anything more complicated than that, is
>> that I was fearing there were weirdos out there for whom "every
>> process in a cgroup is in the same namespace" wouldn't hold, and
>
> Absolutely.
>
>> they'd want to opt this out. But I honestly think this is a very
>> sick usecase.
>
> :)
>
> Don't get me wrong, I don't think it would hurt to always give them
> the cgroup data.  I just think the check is not 'correct'.
>
>>> We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
>>> (etc) file which defaults to 1 (show the cgroup's file data in place of
>>> /proc/meminfo), which can be set to 0 on the host so that the container,
>>> if it wants, can see the host's data.

A container can't want anything. I am more concerned here with the other 
types of use cases.

BTW, A file in each cgroup:

/sys/fs/cgroup/memory/memory.restrict_proc_data (or any other name)
/sys/fs/cgroup/cpu/cpu.restrict_proc_data (or any other name)
etc...

works for me as well.

>>>
>>>> This idea is almost setup-free (with the exception of dumping pids
>>>> into the cgroup files, but if the files are default for all cgroups,
>>>> a 3-line loop can do it in a very future-proof way). But in reality,
>>>> what appeals to me about it, is that it is a mechanism for coupling
>>>> those two
>>>> entities that in our case, should be the same. It provides stronger
>>>> guarantees that we will never be able to see any data outside the
>>>> ones we are untitled to, even we get the bind mounts setup wrongly.
>>>>
>>>> (disclaimer: wild idea ahead)
>>>> If we, for instance, code in such a way that if a certain proc-file
>>>> is per-namespace, the task could get no data at all unless a
>>>> cgroup-binding is set, providing stronger isolation guarantees.
>>>
>>> Are there good reasons to worry about guaranteeing this particular
>>> isolation?  My impression was that this stuff is useful for the
>>> application - the better it can calculate the resources available
>>> to it, the better it can get along with others avoid getting killed
>>> later.  But I didn't think our goal was to try and hide the host
>>> info from the container - we just want to give it most meaningful
>>> info.
>>
>> First of all, note that I am not overly concerned about that.
>> But it may prove useful.
>> If I am in a container side by side with yours, I'd prefer you wouldn't
>> be able to guess anything about me, including my workload type,
>> memory usage, etc, and this could be used by clever exploiters.
>>
>> Besides, /proc holds all sorts of stuff. Networking routing tables
>> and connection status, for example. Those are not just statistics,
>> and should maybe be totally hidden.
>
> I think that should be done separate from this whole discussion - using
> user namespaces.  Any task in a non-initial user namespace will only
> get the world access rights to a procfile.  So if the file isn't world
> readable, then a container won't be able to read it.

Yeah. Well, this was never part of the main discussion anyway =)
I agree with you here.

>>> (That's probably also why this stuff has been languishing - it's
>>> rather low in priority because unlike other things it won't harm
>>> the host)
>>
>> Agreed about that. But hey, at some point it has to be done...
>
> :)
>
> -serge

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/