linux-kernel - Re: [PATCH v5 4/6] memg: calculate numa weight for vmscan

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20110818091750.79eea4f5.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Thu, 18 Aug 2011 09:17:50 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"hannes@...xchg.org" <hannes@...xchg.org>,
	"nishimura@....nes.nec.co.jp" <nishimura@....nes.nec.co.jp>
Subject: Re: [PATCH v5 4/6]  memg: calculate numa weight for vmscan

On Wed, 17 Aug 2011 16:34:18 +0200
Michal Hocko <mhocko@...e.cz> wrote:

> Sorry it took so long but I was quite busy recently.
> 
> On Tue 09-08-11 19:11:00, KAMEZAWA Hiroyuki wrote:
> > caclculate node scan weight.
> > 
> > Now, memory cgroup selects a scan target node in round-robin.
> > It's not very good...there is not scheduling based on page usages.
> > 
> > This patch is for calculating each node's weight for scanning.
> > If weight of a node is high, the node is worth to be scanned.
> > 
> > The weight is now calucauted on following concept.
> > 
> >    - make use of swappiness.
> >    - If inactive-file is enough, ignore active-file
> >    - If file is enough (w.r.t swappiness), ignore anon
> >    - make use of recent_scan/rotated reclaim stats.
> 
> The concept looks good (see the specific comments bellow). I would
> appreciate if the description was more descriptive (especially in the
> reclaim statistics part with the reasoning why it is better).
>  
> > Then, a node contains many inactive file pages will be a 1st victim.
> > Node selection logic based on this weight will be in the next patch.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
> > ---
> >  mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 105 insertions(+), 5 deletions(-)
> > 
> > Index: mmotm-Aug3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-Aug3.orig/mm/memcontrol.c
> > +++ mmotm-Aug3/mm/memcontrol.c
> [...]
> > @@ -1568,18 +1570,108 @@ static bool test_mem_cgroup_node_reclaim
> >  }
> >  #if MAX_NUMNODES > 1
> >  
> > +static unsigned long
> > +__mem_cgroup_calc_numascan_weight(struct mem_cgroup * memcg,
> > +				int nid,
> > +				unsigned long anon_prio,
> > +				unsigned long file_prio,
> > +				int lru_mask)
> > +{
> > +	u64 file, anon;
> > +	unsigned long weight, mask;
> 
> mask is not used anywhere.
> 
I'll remove this.


> > +	unsigned long rotated[2], scanned[2];
> > +	int zid;
> > +
> > +	scanned[0] = 0;
> > +	scanned[1] = 0;
> > +	rotated[0] = 0;
> > +	rotated[1] = 0;
> > +
> > +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> > +		struct mem_cgroup_per_zone *mz;
> > +
> > +		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
> > +		scanned[0] += mz->reclaim_stat.recent_scanned[0];
> > +		scanned[1] += mz->reclaim_stat.recent_scanned[1];
> > +		rotated[0] += mz->reclaim_stat.recent_rotated[0];
> > +		rotated[1] += mz->reclaim_stat.recent_rotated[1];
> > +	}
> > +	file = mem_cgroup_node_nr_lru_pages(memcg, nid, lru_mask & LRU_ALL_FILE);
> > +
> > +	if (total_swap_pages)
> 
> What about ((lru_mask & LRU_ALL_ANON) && total_swap_pages)?

Ok. will add that.

> Why should we go down the mem_cgroup_node_nr_lru_pages if are not getting anything?
> 



> > +		anon = mem_cgroup_node_nr_lru_pages(memcg,
> > +					nid, mask & LRU_ALL_ANON);
> 
> btw. s/mask/lru_mask/
> 
yes...

> > +	else
> > +		anon = 0;
> 
> Can be initialized during declaration (makes patch smaller).
> 
Sure.

> > +	if (!(file + anon))
> > +		node_clear(nid, memcg->scan_nodes);
> 
> In that case we can return with 0 right away.
> 
yes.



> > +
> > +	/* 'scanned - rotated/scanned' means ratio of finding not active. */
> > +	anon = anon * (scanned[0] - rotated[0]) / (scanned[0] + 1);
> > +	file = file * (scanned[1] - rotated[1]) / (scanned[1] + 1);
> 
> OK, makes sense. We should not reclaim from nodes that are known to be
> hard to reclaim from. We, however, have to be careful to not exclude the
> node from reclaiming completely.
> 
> > +
> > +	weight = (anon * anon_prio + file * file_prio) / 200;
> 
> Shouldn't we rather normalize the weight to the node size? This way we
> are punishing bigger nodes, aren't we.
> 

Here, the routine is for reclaiming memory in a memcg in smooth way.
And not for balancing zone. It will be kswapd+memcg(softlimit) work.
The size of node in this memcg is represented by file + anon.


> > +	return weight;
> > +}
> > +
> > +/*
> > + * Calculate each NUMA node's scan weight. scan weight is determined by
> > + * amount of pages and recent scan ratio, swappiness.
> > + */
> > +static unsigned long
> > +mem_cgroup_calc_numascan_weight(struct mem_cgroup *memcg)
> > +{
> > +	unsigned long weight, total_weight;
> > +	u64 anon_prio, file_prio, nr_anon, nr_file;
> > +	int lru_mask;
> > +	int nid;
> > +
> > +	anon_prio = mem_cgroup_swappiness(memcg) + 1;
> > +	file_prio = 200 - anon_prio + 1;
> 
> What is +1 good for. I do not see that anon_prio would be used as a
> denominator.
> 

weight = (anon * anon_prio + file * file_prio) / 200;

Just for avoiding the influence of anon never be 0 (by wrong value
set to swappiness by user.)


> > +
> > +	lru_mask = BIT(LRU_INACTIVE_FILE);
> > +	if (mem_cgroup_inactive_file_is_low(memcg))
> > +		lru_mask |= BIT(LRU_ACTIVE_FILE);
> > +	/*
> > +	 * In vmscan.c, we'll scan anonymous pages with regard to memcg/zone's
> > +	 * amounts of file/anon pages and swappiness and reclaim_stat. Here,
> > +	 * we try to find good node to be scanned. If the memcg contains enough
> > +	 * file caches, we'll ignore anon's weight.
> > +	 * (Note) scanning anon-only node tends to be waste of time.
> > +	 */
> > +
> > +	nr_file = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
> > +	nr_anon = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
> > +
> > +	/* If file cache is small w.r.t swappiness, check anon page's weight */
> > +	if (nr_file * file_prio >= nr_anon * anon_prio)
> > +		lru_mask |= BIT(LRU_INACTIVE_ANON);
> 
> Why we do not care about active anon (e.g. if inactive anon is low)?
> 
This condition is wrong...

	if (nr_file * file_prio <= nr_anon * anon_prio)
		lru_mask |= BIT(LRU_INACTIVE_ANON);

I was worried about LRU_ACTIVE_ANON. I considered
  - We can't handle ACTIVE_ANON and INACTIVE_ANON in the same weight.
    But I don't want to add more magic numbers.
  - vmscan.c:shrink_zone() scans ACTIVE_ANON whenever/only when
    inactive_anon_is_low()==true. SWAP_CLUSTER_MAX per priority.
    It's specially handled.

So, I thought involing the number of ACTIVE_ANON to the weight is difficult
and ignored ACTIVE_ANON, here. Do you have idea ?



> > +
> > +	total_weight = 0;
> 
> Can be initialized during declaration.
> 

will fix.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/