linux-kernel - Re: [RFD] Isolated memory cgroups again

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 21 Oct 2011 12:39:22 +0400
From:	Glauber Costa <glommer@...allels.com>
To:	Ying Han <yinghan@...gle.com>
CC:	Michal Hocko <mhocko@...e.cz>, <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Johannes Weiner <hannes@...xchg.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Hugh Dickins <hughd@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kir Kolyshkin <kir@...allels.com>,
	Pavel Emelianov <xemul@...allels.com>,
	GregThelen <gthelen@...gle.com>,
	"pjt@...gle.com" <pjt@...gle.com>, Tim Hockin <thockin@...gle.com>,
	Dave Hansen <dave@...ux.vnet.ibm.com>,
	Paul Menage <paul@...lmenage.org>,
	James Bottomley <James.Bottomley@...senpartnership.com>
Subject: Re: [RFD] Isolated memory cgroups again

On 10/21/2011 03:41 AM, Ying Han wrote:
> On Wed, Oct 19, 2011 at 6:33 PM, Michal Hocko<mhocko@...e.cz>  wrote:
>> Hi all,
>> this is a request for discussion (I hope we can touch this during memcg
>> meeting during the upcoming KS). I have brought this up earlier this
>> year before LSF (http://thread.gmane.org/gmane.linux.kernel.mm/60464).
>> The patch got much smaller since then due to excellent Johannes' memcg
>> naturalization work (http://thread.gmane.org/gmane.linux.kernel.mm/68724)
>> which this is based on.
>> I realize that this will be controversial but I would like to hear
>> whether this is strictly no-go or whether we can go that direction (the
>> implementation might differ of course).
>>
>> The patch is still half baked but I guess it should be sufficient to
>> show what I am trying to achieve.
>> The basic idea is that memcgs would get a new attribute (isolated) which
>> would control whether that group should be considered during global
>> reclaim.
>> This means that we could achieve a certain memory isolation for
>> processes in the group from the rest of the system activity which has
>> been traditionally done by mlocking the important parts of memory.
>> This approach, however, has some advantages. First of all, it is a kind
>> of all or nothing type of approach. Either the memory is important and
>> mlocked or you have no guarantee that it keeps resident.
>> Secondly it is much more prone to OOM situation.
>> Let's consider a case where a memory is evictable in theory but you
>> would pay quite much if you have to get it back resident (pre calculated
>> data from database - e.g. reports). The memory wouldn't be used very
>> often so it would be a number one candidate to evict after some time.
>> We would want to have something like a clever mlock in such a case which
>> would evict that memory only if the cgroup itself gets under memory
>> pressure (e.g. peak workload). This is not hard to do if we are not
>> over committing the memory but things get tricky otherwise.
>> With the isolated memcgs we get exactly such a guarantee because we would
>> reclaim such a memory only from the hard limit reclaim paths or if the
>> soft limit reclaim if it is set up.
>>
>> Any thoughts comments?
>>
>> ---
>> From: Michal Hocko<mhocko@...e.cz>
>> Subject: Implement isolated cgroups
>>
>> This patch adds a new per-cgroup knob (isolated) which controls whether
>> pages charged for the group should be considered for the global reclaim
>> or they are reclaimed only during soft reclaim and under per-cgroup
>> memory pressure.
>>
>> The value can be modified by GROUP/memory.isolated knob.
>>
>> The primary idea behind isolated cgroups is in a better isolation of a group
>> from the global system activity. At the moment, memory cgroups are mainly
>> used to throttle processes in a group by placing a cap on their memory
>> usage. However, mem. cgroups don't protect their (charged) memory from being
>> evicted by the global reclaim as groups are considered during global
>> reclaim.
>>
>> The feature will provide an easy way to setup a mission critical workload in
>> the memory isolated environment without necessity of mlock. Due to
>> per-cgroup reclaim we can even handle memory usage spikes much more
>> gracefully because a part of the working set can get reclaimed (unlike OOM
>> killed as if mlock has been used). So we can look at the feature as an
>> intelligent mlock (protect from external memory pressure and reclaim on
>> internal pressure).
>>
>> The implementation ignores isolated group status for the soft reclaim which
>> means that every isolated group can configure how much memory it can
>> sacrifice under global memory pressure. Soft unlimited groups are isolated
>> from the global memory pressure completely.
>>
>> Please note that the feature has to be used with caution because isolated
>> groups will make a bigger reclaim pressure to non-isolated cgroups.
>>
>> Implementation is really simple because we just have to hook into shrink_zone
>> and exclude isolated groups if we are doing the global reclaiming.
>>
>> Signed-off-by: Michal Hocko<mhocko@...e.cz>
>>
>> TODO
>> - consider hierarchies - I am not sure whether we want to have
>>   non-consistent isolated status in the hierarchy - probably not
>> - handle root cgroup
>> - Do we want some checks whether the current setting is safe?
>> - is bool sufficient. Don't we rather want something like priority
>>   instead?
>>
>>
>>   include/linux/memcontrol.h |    7 +++++++
>>   mm/memcontrol.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>>   mm/vmscan.c                |    8 +++++++-
>>   3 files changed, 58 insertions(+), 1 deletion(-)
>>
>> Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c
>> ===================================================================
>> --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/memcontrol.c
>> +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/memcontrol.c
>> @@ -258,6 +258,9 @@ struct mem_cgroup {
>>         /* set when res.limit == memsw.limit */
>>         bool            memsw_is_minimum;
>>
>> +       /* is the group isolated from the global memory pressure? */
>> +       bool            isolated;
>> +
>>         /* protect arrays of thresholds */
>>         struct mutex thresholds_lock;
>>
>> @@ -287,6 +290,11 @@ struct mem_cgroup {
>>         spinlock_t pcp_counter_lock;
>>   };
>>
>> +bool mem_cgroup_isolated(struct mem_cgroup *mem)
>> +{
>> +       return mem->isolated;
>> +}
>> +
>>   /* Stuffs for move charges at task migration. */
>>   /*
>>   * Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
>> @@ -4561,6 +4569,37 @@ static int mem_control_numa_stat_open(st
>>   }
>>   #endif /* CONFIG_NUMA */
>>
>> +static int mem_cgroup_isolated_write(struct cgroup *cgrp, struct cftype *cft,
>> +               const char *buffer)
>> +{
>> +       int ret = -EINVAL;
>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +
>> +       if (mem_cgroup_is_root(mem))
>> +               goto out;
>> +
>> +       if (!strcasecmp(buffer, "true"))
>> +               mem->isolated = true;
>> +       else if (!strcasecmp(buffer, "false"))
>> +               mem->isolated = false;
>> +       else
>> +               goto out;
>> +
>> +       ret = 0;
>> +out:
>> +       return ret;
>> +}
>> +
>> +static int mem_cgroup_isolated_read(struct cgroup *cgrp, struct cftype *cft,
>> +               struct seq_file *seq)
>> +{
>> +       struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +
>> +       seq_puts(seq, (mem->isolated)?"true":"false");
>> +
>> +       return 0;
>> +}
>> +
>>   static struct cftype mem_cgroup_files[] = {
>>         {
>>                 .name = "usage_in_bytes",
>> @@ -4624,6 +4663,11 @@ static struct cftype mem_cgroup_files[]
>>                 .unregister_event = mem_cgroup_oom_unregister_event,
>>                 .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>>         },
>> +       {
>> +               .name = "isolated",
>> +               .write_string = mem_cgroup_isolated_write,
>> +               .read_seq_string = mem_cgroup_isolated_read,
>> +       },
>>   #ifdef CONFIG_NUMA
>>         {
>>                 .name = "numa_stat",
>> Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h
>> ===================================================================
>> --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/include/linux/memcontrol.h
>> +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/include/linux/memcontrol.h
>> @@ -165,6 +165,9 @@ void mem_cgroup_split_huge_fixup(struct
>>   bool mem_cgroup_bad_page_check(struct page *page);
>>   void mem_cgroup_print_bad_page(struct page *page);
>>   #endif
>> +
>> +bool mem_cgroup_isolated(struct mem_cgroup *mem);
>> +
>>   #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>>   struct mem_cgroup;
>>
>> @@ -382,6 +385,10 @@ static inline
>>   void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
>>   {
>>   }
>> +bool mem_cgroup_isolated(struct mem_cgroup *mem)
>> +{
>> +       return false;
>> +}
>>   #endif /* CONFIG_CGROUP_MEM_CONT */
>>
>>   #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
>> Index: linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c
>> ===================================================================
>> --- linux-3.1-rc4-next-20110831-mmotm-isolated-memcg.orig/mm/vmscan.c
>> +++ linux-3.1-rc4-next-20110831-mmotm-isolated-memcg/mm/vmscan.c
>> @@ -2109,7 +2109,13 @@ static void shrink_zone(int priority, st
>>                         .zone = zone,
>>                 };
>>
>> -               shrink_mem_cgroup_zone(priority,&mz, sc);
>> +               /*
>> +                * Do not reclaim from an isolated group if we are in
>> +                * the global reclaim.
>> +                */
>> +               if (!(mem_cgroup_isolated(mem)&&  global_reclaim(sc)))
>> +                       shrink_mem_cgroup_zone(priority,&mz, sc);
>> +
>>                 /*
>>                  * Limit reclaim has historically picked one memcg and
>>                  * scanned it with decreasing priority levels until
>> --
>> Michal Hocko
>> SUSE Labs
>> SUSE LINUX s.r.o.
>> Lihovarska 1060/12
>> 190 00 Praha 9
>> Czech Republic
>>
>
> Hi Michal:
>
> I didn't read through the patch itself but only the description. If we
> wanna protect a memcg being reclaimed from under global memory
> pressure, I think we can approach it by making change on soft_limit
> reclaim.
>
> I have a soft_limit change built on top of Johannes's patchset, which
> does basically soft_limit aware reclaim under global memory pressure.
> The implementation is simple, and I am looking forward to discuss more
> with you guys in the conference.
>
> --Ying
I don't think soft limits will help his case, if I know understand it 
correctly. Global reclaim can be triggered regardless of any soft limits 
we may set.

Now, there are two things I still don't like about it:
* The definition of a "main workload", "main cgroup", or anything like 
that. I'd prefer to rank them according to some parameter, something 
akin to swapiness. This would allow for other people to use it in a 
different way, while still making you capable of reaching your goals 
through parameter settings (i.e. one cgroup has a high value of reclaim, 
all others, a much lower one)

* The fact that you seem to want to *skip* reclaim altogether for a 
cgroup. That's a dangerous condition, IMHO. What I think we should try 
to achieve, is "skip it for practical purposes on sane workloads". 
Again, a parameter that when set to a very high mark, has the effect of 
disallowing reclaim for a cgroup under most sane circumstances.

What do you think of the above, Michal ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/