linux-kernel - Re: [patch] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 11 Dec 2013 20:09:03 -0500
From:	Johannes Weiner <hannes@...xchg.org>
To:	Mel Gorman <mgorman@...e.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Dave Hansen <dave.hansen@...el.com>,
	Rik van Riel <riel@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [patch] mm: page_alloc: exclude unreclaimable allocations from
 zone fairness policy

On Wed, Dec 11, 2013 at 10:47:19PM +0000, Mel Gorman wrote:
> On Wed, Dec 11, 2013 at 01:09:16PM -0500, Johannes Weiner wrote:
> > Dave Hansen noted a regression in a microbenchmark that loops around
> > open() and close() on an 8-node NUMA machine and bisected it down to
> > 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
> > change forces the slab allocations of the file descriptor to spread
> > out to all 8 nodes, causing remote references in the page allocator
> > and slab.
> > 
> 
> The original patch was primarily concerned with the fair aging of LRU pages
> of zones within a node. This patch uses GFP_MOVABLE_MASK which includes
> __GFP_RECLAIMABLE meaning any slab created with SLAB_RECLAIM_ACCOUNT is still
> getting the round-robin treatment. Those pages have a different lifecycle
> to LRU pages and the shrinkers are only node aware, not zone aware.
> While I get this patch probably helps this specific benchmark, was the
> use of GFP_MOVABLE_MASK intentional or did you mean to use __GFP_MOVABLE?

It was intentional to spread SLAB_RECLAIM_ACCOUNT pages across all
allowed nodes evenly for the same aging fairness reason.

> Looking at the original patch again I think I made a major mistake when
> reviewing it. Considering the effect of the following for NUMA machines
> 
>         for_each_zone_zonelist_nodemask(zone, z, zonelist,
>                                                 high_zoneidx, nodemask) {
> 		....
>                 if (alloc_flags & ALLOC_WMARK_LOW) {
>                         if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
> 				continue;
>                         if (zone_reclaim_mode &&
>                             !zone_local(preferred_zone, zone))
>                                 continue;
> 		}
> 
> 
> Enabling zone_reclaim_mode sucks badly for workloads that are not paritioned
> to fit within NUMA nodes. Consequently, I expect the common case it that
> it's disabled by default due to small NUMA distances or manually disabled.
> 
> However, the effect of that block is that we allocate NR_ALLOC_BATCH
> from local zones then fallback to batch allocating remote nodes! I bet
> the numa_hit stats in /proc/vmstat have sucked recently. The original
> problem was because the page allocator would try allocating from the
> highest zone while kswapd reclaimed from it causing LRU-aging problems.
> The problem is not the same between nodes. How do you feel about dropping
> the zone_reclaim_mode check above and only round-robin in batches between
> zones on the local node?

It might not be for anon but it's the same problem for cache.  The
page allocator will fill all the nodes in the system before waking up
the kswapds.  It will utilize all nodes, just not evenly.

I know that on the node-level staying local is often preferrable over
full memory utilization but I was under the assumption that
zone_reclaim_mode is there to express this preference.

My patch certainly makes this preference more aggressive in the sense
that there is no grayzone anymore.  There is no try to stay local.
There is either not using a block of memory at all, or using it to the
same extent as any other block of the same size; that's the
requirement for fair aging.

That being said, the fairness concerns are primarily about file pages.
Should we exclude anon and slab pages entirely?  I'd still account for
them in the batches but only apply placement rules to page cache.
That should still leave us with roughly equal cache aging speeds in
all zones and nodes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/