linux-kernel - Re: [RFC 0/4] memcg: Low-limit reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xr931tzphu50.fsf@gthelen.mtv.corp.google.com>
Date:	Thu, 30 Jan 2014 16:28:27 -0800
From:	Greg Thelen <gthelen@...gle.com>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	linux-mm@...ck.org, Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Ying Han <yinghan@...gle.com>, Hugh Dickins <hughd@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Tejun Heo <tj@...nel.org>
Subject: Re: [RFC 0/4] memcg: Low-limit reclaim

On Thu, Jan 30 2014, Michal Hocko wrote:

> On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> [...]
>> The series looks useful.  We (Google) have been using something similar.
>> In practice such a low_limit (or memory guarantee), doesn't nest very
>> well.
>> 
>> Example:
>>   - parent_memcg: limit 500, low_limit 500, usage 500
>>     1 privately charged non-reclaimable page (e.g. mlock, slab)
>>   - child_memcg: limit 500, low_limit 500, usage 499
>
> I am not sure this is a good example. Your setup basically say that no
> single page should be reclaimed. I can imagine this might be useful in
> some cases and I would like to allow it but it sounds too extreme (e.g.
> a load which would start trashing heavily once the reclaim starts and it
> makes more sense to start it again rather than crowl - think about some
> mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

>> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> page cache it will lead to an oom kill instead of reclaiming. 
>
> Does it make any sense to protect all of such memory although it is
> easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

>> One could argue that this is working as intended because child_memcg
>> was promised 500 but can only get 499.  So child_memcg is oom killed
>> rather than being forced to operate below its promised low limit.
>> 
>> This has led to various internal workarounds like:
>> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>>   only charge memory to cgroup leafs.  This gets tricky when dealing
>>   with reparented memory inherited to parent from child during cgroup
>>   deletion.
>
> Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

>> - don't set low_limit on non leafs (e.g. do not set low limit on
>>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>>   customers want to purchase $MEM and setup their workload with a few
>>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>>   for top-level containers (e.g. parent_memcg).  Thereafter such
>>   customers are able to partition their workload with sub memcg below
>>   child_memcg.  Example:
>>      parent_memcg
>>          \
>>           child_memcg
>>             /     \
>>         server   backup
>
> I think that the low_limit makes sense where you actually want to
> protect something from reclaim. And backup sounds like a bad fit for
> that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
      \
       child_memcg limit 500, low_limit 500, usage 500
         /     \
         |   backup   limit 10, low_limit 10, usage 10
         |
      server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

>>   Thereafter customers often want some weak isolation between server and
>>   backup.  To avoid undesired oom kills the server/backup isolation is
>>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>>   limit acts like the low_limit until priority becomes desperate.
>
> Johannes was already suggesting that the low_limit should allow for a
> weaker semantic as well. I am not very much inclined to that but I can
> leave with a knob which would say oom_on_lowlimit (on by default but
> allowed to be set to 0). We would fallback to the full reclaim if
> no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinking of something like:

bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
		struct mem_cgroup *root,
		int priority)
{
	do {
		if (memcg == root)
			break;
		if (!res_counter_low_limit_excess(&memcg->res))
			return false;
		if ((priority >= DEF_PRIORITY - 2) &&
		    !res_counter_soft_limit_exceed(&memcg->res))
			return false;
	} while ((memcg = parent_mem_cgroup(memcg)));
	return true;
}

But this soft_limit,priority extension can be added later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/