linux-kernel - Re: [patch 0/8] mm: memcg naturalization -rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTincHpoay1JtpjG0RY9CCvfepRohTXUH6KKULYJ9jbdo+A@mail.gmail.com>
Date:	Wed, 8 Jun 2011 20:52:03 -0700
From:	Ying Han <yinghan@...gle.com>
To:	Johannes Weiner <hannes@...xchg.org>
Cc:	Hiroyuki Kamezawa <kamezawa.hiroyuki@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Michal Hocko <mhocko@...e.cz>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Minchan Kim <minchan.kim@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>, Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [patch 0/8] mm: memcg naturalization -rc2

On Wed, Jun 8, 2011 at 8:32 AM, Johannes Weiner <hannes@...xchg.org> wrote:
> On Tue, Jun 07, 2011 at 08:53:21PM -0700, Ying Han wrote:
>> On Thu, Jun 2, 2011 at 10:51 AM, Johannes Weiner <hannes@...xchg.org> wrote:
>> >
>> > On Thu, Jun 02, 2011 at 08:51:39AM -0700, Ying Han wrote:
>> > > However on this patchset, we changed that design and doing
>> > > hierarchy_walk of the memcg tree. Can we clarify more on why we made
>> > > the design change? I can see the current design provides a efficient
>> > > way to pick the one memcg over-their-soft-limit under shrink_zone().
>> >
>> > The question is whether we even want it to work that way.  I outlined
>> > that in the changelog of the soft limit rework patch.
>> >
>> > As I see it, the soft limit should not exist solely to punish a memcg,
>> > but to prioritize memcgs in case hierarchical pressure exists.  I am
>> > arguing that the focus should be on relieving the pressure, rather
>> > than beating the living crap out of the single-biggest offender.  Keep
>> > in mind the scenarios where the biggest offender has a lot of dirty,
>> > hard-to-reclaim pages while there are other, unsoftlimited groups that
>> > have large amounts of easily reclaimable cache of questionable future
>> > value.  I believe only going for soft-limit excessors is too extreme,
>> > only for the single-biggest one outright nuts.
>> >
>> > The second point I made last time already is that there is no
>> > hierarchy support with that current scheme.  If you have a group with
>> > two subgroups, it makes sense to soft limit one subgroup against the
>> > other when the parent hits its limit.  This is not possible otherwise.
>> >
>> > The third point was that the amount of code to actually support the
>> > questionable behaviour of picking the biggest offender is gigantic
>> > compared to naturally hooking soft limit reclaim into regular reclaim.
>>
>> Ok, thank you for detailed clarification. After reading through the
>> patchset more closely, I do agree that it makes
>> better integration of memcg reclaim to the other part of vm reclaim
>> code. So I don't have objection at this point to
>> proceed w/ this direction. However, three of my concerns still remains:
>>
>> 1.  Whether or not we introduced extra overhead for each shrink_zone()
>> under global memory pressure. We used to have quick
>> access of memcgs to reclaim from who has pages charged on the zone.
>> Now we need to do hierarchy_walk for all memcgs on the system. This
>> requires more testing and more data results would be helpful
>
> That's a nice description for "we went ahead and reclaimed pages from
> a zone without any regard for memory control groups" ;-)
>
> But OTOH I agree with you of course, we may well have to visit a
> number of memcgs before finding any that have memory allocated from
> the zone we are trying to reclaim from.
>
>> 2. The way we treat the per-memcg soft_limit is changed in this patch.
>> The same comment I made on the following patch where we shouldn't
>> change the definition of user API (soft_limit_in_bytes in this case).
>> So I attached the patch to fix that where we should only go to the
>> ones under their soft_limit above certain reclaim priority. Please
>> consider.
>
> Here is your proposal from the other mail:
>
> : Basically, we shouldn't reclaim from a memcg under its soft_limit
> : unless we have trouble reclaim pages from others. Something like the
> : following makes better sense:
> :
> : diff --git a/mm/vmscan.c b/mm/vmscan.c
> : index bdc2fd3..b82ba8c 100644
> : --- a/mm/vmscan.c
> : +++ b/mm/vmscan.c
> : @@ -1989,6 +1989,8 @@ restart:
> :         throttle_vm_writeout(sc->gfp_mask);
> :  }
> :
> : +#define MEMCG_SOFTLIMIT_RECLAIM_PRIORITY       2
> : +
> :  static void shrink_zone(int priority, struct zone *zone,
> :                                 struct scan_control *sc)
> :  {
> : @@ -2001,13 +2003,13 @@ static void shrink_zone(int priority, struct zone *zone,
> :                 unsigned long reclaimed = sc->nr_reclaimed;
> :                 unsigned long scanned = sc->nr_scanned;
> :                 unsigned long nr_reclaimed;
> : -               int epriority = priority;
> :
> : -               if (mem_cgroup_soft_limit_exceeded(root, mem))
> : -                       epriority -= 1;
> : +               if (!mem_cgroup_soft_limit_exceeded(root, mem) &&
> : +                               priority > MEMCG_SOFTLIMIT_RECLAIM_PRIORITY)
> : +                       continue;
>
> I am not sure if you are serious or playing devil's advocate here,
> because it exacerbates the problem you are concerned about in 1. by
> orders of magnitude.

No, the two are different issues. The first one is a performance
concern of detailed implementation, while the second one is a design
concern. On the second, I would like us to not changing kernel API
spec (soft_limit) in this case :)

> Starting priority is 12.  If you have no groups over soft limit, you
> iterate the whole hierarchy 10 times before you even begin to think of
> reclaiming something.

I agree that the patch I posted might make the performance issue even
bigger, which we need to look into next. But I just want to demo my
understanding of reclaim based on soft_limit.

>
> I guess it would make much more sense to evaluate if reclaiming from
> memcgs while there are others exceeding their soft limit is even a
> problem.  Otherwise this discussion is pretty pointless.

AFAIK it is a problem since it changes the spec of kernel API
memory.soft_limit_in_bytes. That value is set per-memcg which all the
pages allocated above that are best effort and targeted to reclaim
prior to others.

>
>> 3. Please break this patchset into different patchsets. One way to
>> break it could be:
>
> Yes, that makes a ton of sense.  Kame suggested the same thing, there
> are too much goals in this series.
>
>> a) code which is less relevant to this effort and should be merged
>> first early regardless
>> b) code added in vm reclaim supporting the following changes
>> c) rework soft limit reclaim
>
> I dropped that for now..

Ok, we can make the soft_limit reclaim into a seperate patch including
cleaning up the current implementation. I can probably pick that up
after we agreed how to do global reclaim based on soft_limit (last
comment)

>
>> d) make per-memcg lru lists exclusive
>
> ..and focus on this one instead.

>
>> I should have the patch posted soon which breaks the zone->lru lock
>> for memcg reclaim. That patch should come after everything listed
>> above.
>
> Yeah, the lru lock fits perfectly into struct lruvec.
>
I posted that patch today and please take a look when you get chance.
That patch relies on the
d), so i will wait till the previous patches merged.

--Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/