linux-kernel - Re: [patch 2/8] mm: memcg-aware global reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110609172347.GB20333@cmpxchg.org>
Date:	Thu, 9 Jun 2011 19:23:47 +0200
From:	Johannes Weiner <hannes@...xchg.org>
To:	Minchan Kim <minchan.kim@...il.com>
Cc:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Ying Han <yinghan@...gle.com>, Michal Hocko <mhocko@...e.cz>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>, Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [patch 2/8] mm: memcg-aware global reclaim

On Fri, Jun 10, 2011 at 12:48:39AM +0900, Minchan Kim wrote:
> On Wed, Jun 01, 2011 at 08:25:13AM +0200, Johannes Weiner wrote:
> > When a memcg hits its hard limit, hierarchical target reclaim is
> > invoked, which goes through all contributing memcgs in the hierarchy
> > below the offending memcg and reclaims from the respective per-memcg
> > lru lists.  This distributes pressure fairly among all involved
> > memcgs, and pages are aged with respect to their list buddies.
> > 
> > When global memory pressure arises, however, all this is dropped
> > overboard.  Pages are reclaimed based on global lru lists that have
> > nothing to do with container-internal age, and some memcgs may be
> > reclaimed from much more than others.
> > 
> > This patch makes traditional global reclaim consider container
> > boundaries and no longer scan the global lru lists.  For each zone
> > scanned, the memcg hierarchy is walked and pages are reclaimed from
> > the per-memcg lru lists of the respective zone.  For now, the
> > hierarchy walk is bounded to one full round-trip through the
> > hierarchy, or if the number of reclaimed pages reach the overall
> > reclaim target, whichever comes first.
> > 
> > Conceptually, global memory pressure is then treated as if the root
> > memcg had hit its limit.  Since all existing memcgs contribute to the
> > usage of the root memcg, global reclaim is nothing more than target
> > reclaim starting from the root memcg.  The code is mostly the same for
> > both cases, except for a few heuristics and statistics that do not
> > always apply.  They are distinguished by a newly introduced
> > global_reclaim() primitive.
> > 
> > One implication of this change is that pages have to be linked to the
> > lru lists of the root memcg again, which could be optimized away with
> > the old scheme.  The costs are not measurable, though, even with
> > worst-case microbenchmarks.
> > 
> > As global reclaim no longer relies on global lru lists, this change is
> > also in preparation to remove those completely.

[cut diff]

> I didn't look at all, still. You might change the logic later patches.
> If I understand this patch right, it does round-robin reclaim in all memcgs
> when global memory pressure happens.
> 
> Let's consider this memcg size unbalance case.
> 
> If A-memcg has lots of LRU pages, scanning count for reclaim would be bigger
> so the chance to reclaim the pages would be higher.
> If we reclaim A-memcg, we can reclaim the number of pages we want easily and break.
> Next reclaim will happen at some time and reclaim will start the B-memcg of A-memcg
> we reclaimed successfully before. But unfortunately B-memcg has small lru so
> scanning count would be small and small memcg's LRU aging is higher than bigger memcg.
> It means small memcg's working set can be evicted easily than big memcg.
> my point is that we should not set next memcg easily.
> We have to consider memcg LRU size.

I may be missing something, but you said yourself that B had a smaller
scan count compared to A, so the aging speed should be proportional to
respective size.

The number of pages scanned per iteration is essentially

	number of lru pages in memcg-zone >> priority

so we scan relatively more pages from B than from A each round.

It's the exact same logic we have been applying traditionally to
distribute pressure fairly among zones to equalize their aging speed.

Is that what you meant or are we talking past each other?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/