linux-kernel - Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160223205915.GA10744@cmpxchg.org>
Date:	Tue, 23 Feb 2016 12:59:15 -0800
From:	Johannes Weiner <hannes@...xchg.org>
To:	Mel Gorman <mgorman@...hsingularity.net>
Cc:	Linux-MM <linux-mm@...ck.org>, Rik van Riel <riel@...riel.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 00/27] Move LRU page reclaim from zones to nodes v2

On Tue, Feb 23, 2016 at 08:19:32PM +0000, Mel Gorman wrote:
> On Tue, Feb 23, 2016 at 12:04:16PM -0800, Johannes Weiner wrote:
> > On Tue, Feb 23, 2016 at 03:04:23PM +0000, Mel Gorman wrote:
> > > In many benchmarks, there is an obvious difference in the number of
> > > allocations from each zone as the fair zone allocation policy is removed
> > > towards the end of the series. For example, this is the allocation stats
> > > when running blogbench that showed no difference in headling performance
> > > 
> > >                           mmotm-20160209   nodelru-v2
> > > DMA allocs                           0           0
> > > DMA32 allocs                   7218763      608067
> > > Normal allocs                 12701806    18821286
> > > Movable allocs                       0           0
> > 
> > According to the mmotm numbers, your DMA32 zone is over a third of
> > available memory, yet in the nodelru-v2 kernel sees only 3% of the
> > allocations.
> 
> In this case yes but blogbench is not scaled to memory size and is not
> reclaim intensive. If you look, you'll see the total number of overall
> allocations is very similar. During that test, there is a small amount of
> kswapd scan activity (but not reclaim which is odd) at the start of the
> test for nodelru but that's about it.

Yes, if fairness enforcement is now done by reclaim, then workloads
without reclaim will show skewed placement as the Normal zone is again
filled up first before moving on to the next zone.

That is fine. But what about the balance in reclaiming workloads?

> > That's an insanely high level of aging inversion, where
> > the lifetime of a cache entry is again highly dependent on placement.
> > 
> 
> The aging is now indepdant of what zone the page was allocated from because
> it's node-based LRU reclaim. That may mean that the occupancy of individual
> zones is now different but it should only matter if there is a large number
> of address-limited requests.

The problem is that kswapd will stay awake and continuously draw
subsequent allocations into a single zone, thus utilizing only a
fraction of available memory. A DMA32-limited kswapd wakeups can
reclaim cache in DMA32 continuously if the allocator continously
places new cache pages in that zone. It looks like that is what
happened in the stutter benchmark.

Sure, it doesn't matter in that benchmark, because the pages are used
only once. But if it had an actual cache workingset bigger than DMA32
but smaller than DMA32+Normal, it would be thrashing unnecessarily.

If kswapd were truly balancing the pages in a node equally, regardless
of zone placement, then in the long run we should see zone allocations
converge to a share that is in proportion to each zone's size. As far
as I can see, that is not quite happening yet.

> > The fact that this doesn't make a performance difference in the
> > specific benchmarks you ran only proves just that: these specific
> > benchmarks don't care. IMO, benchmarking is not enough here. If this
> > is truly supposed to be unproblematic, then I think we need a reasoned
> > explanation. I can't imagine how it possibly could be, though.
> > 
> 
> The basic explanation is that reclaim is on a per-node basis and we
> no longer balance all zones, just one that is necessary to satisfy the
> original request that wokeup kswapd.
> 
> > If reclaim can't guarantee a balanced zone utilization then the
> > allocator has to keep doing it. :(
> 
> That's the key issue - the main reason balanced zone utilisation is
> necessary is because we reclaim on a per-zone basis and we must avoid
> page aging anomalies. If we balance such that one eligible zone is above
> the watermark then it's less of a concern.

Yes, but only if there can't be extended reclaim stretches that prefer
the pages of a single zone. Yet it looks like this is still possible.

I wonder if that were fixed by dropping patch 7/27? Potentially it
would need a bit more work than that. I.e. could we make kswapd
balance only for the highest classzone in the system, and thus make
address-limited allocations fend for themselves in direct reclaim?

This way, we would avoid that pathological interaction between kswapd
and the allocator, and kswapd would be guaranteed to balance fairly.