linux-kernel - Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to children

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <502E0BC3.8090204@parallels.com>
Date:	Fri, 17 Aug 2012 13:15:47 +0400
From:	Glauber Costa <glommer@...allels.com>
To:	Michal Hocko <mhocko@...e.cz>
CC:	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<cgroups@...r.kernel.org>, <devel@...nvz.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	<kamezawa.hiroyu@...fujitsu.com>, Christoph Lameter <cl@...ux.com>,
	David Rientjes <rientjes@...gle.com>,
	Pekka Enberg <penberg@...nel.org>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Suleiman Souhlal <suleiman@...gle.com>
Subject: Re: [PATCH v2 09/11] memcg: propagate kmem limiting information to
 children

On 08/17/2012 01:00 PM, Michal Hocko wrote:
> On Thu 09-08-12 17:01:17, Glauber Costa wrote:
>> The current memcg slab cache management fails to present satisfatory
>> hierarchical behavior in the following scenario:
>>
>> -> /cgroups/memory/A/B/C
>>
>> * kmem limit set at A,
>> * A and B have no tasks,
>> * span a new task in in C.
>>
>> Because kmem_accounted is a boolean that was not set for C, no
>> accounting would be done. This is, however, not what we expect.
>>
>> The basic idea, is that when a cgroup is limited, we walk the tree
>> upwards 
> 
> Isn't it rather downwards? We start at A and then mark all children so
> we go down the tree. Moreover the walk is not atomic wrt. parallel
> charges nor to a new child creation. First one seems to be acceptable
> as the charges go to the root. The second one requires cgroup_lock.
> 

Yes, it is downwards. I've already noticed that yesterday and updated
in my tree.

As for the lock, can't we take set_limit lock in cgroup creation just
around the place that updates that field in the child? It is a lot more
fine grained - everything except the dead bkl is -  and what we're
actually protecting is the limit.

If you prefer, I can use cgroup lock just fine. But then I won't sleep
at night and probably pee my pants, which is something I don't do for at
least two decades now.

> It also seems that you are missing memcg_kmem_account_parent in
> mem_cgroup_create (use_hierarchy path) if memcg_kmem_is_accounted(parent).
> 

You mean when we create a cgroup ontop of an already limited parent?
Humm, you are very right.

> Some further "wording" comments below. Other than that the patch looks
> correct.
> 
>> (something Kame and I already thought about doing for other
>> purposes), and make sure that we store the information about the parent
>> being limited in kmem_accounted (that is turned into a bitmap: two
>> booleans would not be space efficient). 
> 
> Two booleans even don't serve the purpose because you want to test this
> atomically, right?
> 

Well, yes, we have that extra problem as well.
>> The code for that is taken from sched/core.c. My reasons for not
>> putting it into a common place is to dodge the type issues that would
>> arise from a common implementation between memcg and the scheduler -
>> but I think that it should ultimately happen, so if you want me to do
>> it now, let me know.
> 
> Is this really relevant for the patch?
> 

Not at all. Besides not being relevant, it is also not true, since I now
use the memcg iterator. I would prefer the tree walk instead of having
to cope with the order imposed by the memcg iterator, but we add
less code this way...

Again, already modified that in my yesterday's update.

>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3216292..3d30b79 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -295,7 +295,8 @@ struct mem_cgroup {
>>  	 * Should the accounting and control be hierarchical, per subtree?
>>  	 */
>>  	bool use_hierarchy;
>> -	bool kmem_accounted;
>> +
>> +	unsigned long kmem_accounted; /* See KMEM_ACCOUNTED_*, below */
>>  
>>  	bool		oom_lock;
>>  	atomic_t	under_oom;
>> @@ -348,6 +349,38 @@ struct mem_cgroup {
>>  #endif
>>  };
>>  
>> +enum {
>> +	KMEM_ACCOUNTED_THIS, /* accounted by this cgroup itself */
>> +	KMEM_ACCOUNTED_PARENT, /* accounted by any of its parents. */
> 
> How it can be accounted by its parent, the charge doesn't go downwards.
> Shouldn't it rather be /* a parent is accounted */
> 
indeed.

>> +};
>> +
>> +#ifdef CONFIG_MEMCG_KMEM
>> +static bool memcg_kmem_account(struct mem_cgroup *memcg)
> 
> memcg_kmem_set_account? It matches _clear_ counterpart and it makes
> obvious that the value is changed actually.
> 

Ok.

> [...]
>> +static bool memcg_kmem_is_accounted(struct mem_cgroup *memcg)
>> +{
>> +	return test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted);
>> +}
>> +
>> +static void memcg_kmem_account_parent(struct mem_cgroup *memcg)
> 
> same here _set_parent
> 

Ok, agreed.

> [...]
>> @@ -614,7 +647,7 @@ EXPORT_SYMBOL(__memcg_kmem_free_page);
>>  
>>  static void disarm_kmem_keys(struct mem_cgroup *memcg)
>>  {
>> -	if (memcg->kmem_accounted)
>> +	if (test_bit(KMEM_ACCOUNTED_THIS, &memcg->kmem_accounted))
> 
> memcg_kmem_is_accounted. I do not see any reason to open code this.
> 

ok.

>>  #ifdef CONFIG_MEMCG_KMEM
>> -	/*
>> -	 * Once enabled, can't be disabled. We could in theory disable it if we
>> -	 * haven't yet created any caches, or if we can shrink them all to
>> -	 * death. But it is not worth the trouble.
>> -	 */
>> +	struct mem_cgroup *iter;
>> +
>>  	mutex_lock(&set_limit_mutex);
>> -	if (!memcg->kmem_accounted && val != RESOURCE_MAX) {
>> +	if ((val != RESOURCE_MAX) && memcg_kmem_account(memcg)) {
>> +
>> +		/*
>> +		 * Once enabled, can't be disabled. We could in theory disable
>> +		 * it if we haven't yet created any caches, or if we can shrink
>> +		 * them all to death. But it is not worth the trouble
>> +		 */
>>  		static_key_slow_inc(&memcg_kmem_enabled_key);
>> -		memcg->kmem_accounted = true;
>> +
>> +		if (!memcg->use_hierarchy)
>> +			goto out;
>> +
>> +		for_each_mem_cgroup_tree(iter, memcg) {
> 
> for_each_mem_cgroup_tree does respect use_hierarchy so the above
> shortcut is not necessary. Dunno but IMHO we should get rid of explicit
> tests as much as possible. This doesn't look like a hot path anyway.
> 

I can't remember any reason for doing so other than gaining some time.
I will remove it.

>> +			if (iter == memcg)
>> +				continue;
>> +			memcg_kmem_account_parent(iter);
>> +		}
>> +	} else if ((val == RESOURCE_MAX) && memcg_kmem_clear_account(memcg)) {
> 
> Above you said "Once enabled, can't be disabled." and now you can
> disable it? Say you are a leaf group with non accounted parents. This
> will clear the flag and so no further accounting is done. Shouldn't
> unlimited mean that we will never reach the limit? Or am I missing
> something?
>

You are missing something, and maybe I should be more clear about that.
The static branches can't be disabled (it is only safe to disable them
from disarm_static_branches(), when all references are gone). Note that
when unlimited, we flip bits, do a transversal, but there is no mention
to the static branch.

The limiting can come and go at will.

>> +
>> +		if (!memcg->use_hierarchy)
>> +			goto out;
>> +
>> +		for_each_mem_cgroup_tree(iter, memcg) {
>> +			struct mem_cgroup *parent;
>> +
>> +			if (iter == memcg)
>> +				continue;
>> +			/*
>> +			 * We should only have our parent bit cleared if none
>> +			 * of our parents are accounted. The transversal order
>> +			 * of our iter function forces us to always look at the
>> +			 * parents.
>> +			 */
>> +			parent = parent_mem_cgroup(iter);
>> +			for (; parent != memcg; parent = parent_mem_cgroup(iter))
>> +				if (memcg_kmem_is_accounted(parent))
>> +					goto noclear;
>> +			memcg_kmem_clear_account_parent(iter);
> 
> Brain hurts...
> Yes we are iterating in the creation ordering so we cannot rely on the
> first encountered accounted memcg
> A(a) - B - D
>      - C (a) - E
> 
> 

That's why I said I preferred the iterator the scheduler uses. The
actual transverse code was much simpler, because it will stop at an
unlimited parent. But this is the only drawback I see in the memcg
iterator, so I decided that just documenting this "interesting" piece of
code well would do...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/