linux-kernel - Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130802062208.GP715@cmpxchg.org>
Date:	Fri, 2 Aug 2013 02:22:08 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Minchan Kim <minchan@...nel.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Andrea Arcangeli <aarcange@...hat.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [patch 3/3] mm: page_alloc: fair zone allocator policy

On Thu, Aug 01, 2013 at 11:56:36AM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Jul 19, 2013 at 04:55:25PM -0400, Johannes Weiner wrote:
> > Each zone that holds userspace pages of one workload must be aged at a
> > speed proportional to the zone size.  Otherwise, the time an
> > individual page gets to stay in memory depends on the zone it happened
> > to be allocated in.  Asymmetry in the zone aging creates rather
> > unpredictable aging behavior and results in the wrong pages being
> > reclaimed, activated etc.
> > 
> > But exactly this happens right now because of the way the page
> > allocator and kswapd interact.  The page allocator uses per-node lists
> > of all zones in the system, ordered by preference, when allocating a
> > new page.  When the first iteration does not yield any results, kswapd
> > is woken up and the allocator retries.  Due to the way kswapd reclaims
> > zones below the high watermark while a zone can be allocated from when
> > it is above the low watermark, the allocator may keep kswapd running
> > while kswapd reclaim ensures that the page allocator can keep
> > allocating from the first zone in the zonelist for extended periods of
> > time.  Meanwhile the other zones rarely see new allocations and thus
> > get aged much slower in comparison.
> > 
> > The result is that the occasional page placed in lower zones gets
> > relatively more time in memory, even get promoted to the active list
> > after its peers have long been evicted.  Meanwhile, the bulk of the
> > working set may be thrashing on the preferred zone even though there
> > may be significant amounts of memory available in the lower zones.
> > 
> > Even the most basic test -- repeatedly reading a file slightly bigger
> > than memory -- shows how broken the zone aging is.  In this scenario,
> > no single page should be able stay in memory long enough to get
> > referenced twice and activated, but activation happens in spades:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 0
> >       nr_active_file 8
> >       nr_inactive_file 1582
> >       nr_active_file 11994
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 70
> >       nr_inactive_file 258753
> >       nr_active_file 443214
> >       nr_inactive_file 149793
> >       nr_active_file 12021
> > 
> > Fix this with a very simple round robin allocator.  Each zone is
> > allowed a batch of allocations that is proportional to the zone's
> > size, after which it is treated as full.  The batch counters are reset
> > when all zones have been tried and the allocator enters the slowpath
> > and kicks off kswapd reclaim:
> > 
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 174
> >       nr_active_file 4865
> >       nr_inactive_file 53
> >       nr_active_file 860
> >   $ cat data data data data >/dev/null
> >   $ grep active_file /proc/zoneinfo
> >       nr_inactive_file 0
> >       nr_active_file 0
> >       nr_inactive_file 666622
> >       nr_active_file 4988
> >       nr_inactive_file 190969
> >       nr_active_file 937
> 
> First of all, I should appreciate your great work!
> It's amazing and I saw Zlatko proved it enhances real works.
> Thanks Zlatko, too!
> 
> So, I don't want to prevent merging but I think at least, we should
> discuss some issues.
> 
> The concern I have is that it could accelerate low memory pinning
> problems like mlock. Actually, I don't have such workload that makes
> pin lots of pages but that's why we introduced lowmem_reserve_ratio,
> as you know well so we should cover this issue, at least.

We are not actually using the lower zones more in terms of number of
pages at any given time, we are just using them more in terms of
allocation and reclaim rate.

Consider how the page allocator works: it will first fill up the
preferred zone, then the next best zone, etc. and then when all of
them are full, it will wake up kswapd, which will only reclaim them
back to the high watermark + balance gap.

After my change, all zones will be at the high watermarks + balance
gap during idle and between the low and high marks under load.  The
change is marginal.

The lowmem reserve makes sure, before and after, that a sizable
portion of low memory is reserved for non-user allocations.

> Other thing of my concerns is to add overhead in fast path.
> Sometime, we are really reluctant to add simple even "if" condition
> in fastpath but you are adding atomic op whenever page is allocated and
> enter slowpath whenever all of given zones's batchcount is zero.
> Yes, it's not really slow path because it could return to normal status
> without calling significant slow functions by reset batchcount of
> prepare_slowpath.

Yes, no direct reclaim should occur.  I know that it comes of a cost,
but the alternative is broken page reclaim, page cache thrashing etc.
I don't see a way around it...

Thanks for your input, Minchan!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/