linux-kernel - Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100302112448.GJ3852@csn.ul.ie>
Date:	Tue, 2 Mar 2010 11:24:48 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Nick Piggin <npiggin@...e.de>
Cc:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	epasch@...ibm.com, SCHILLIG@...ibm.com,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	christof.schmitt@...ibm.com, thoss@...ibm.com, hare@...e.de,
	gregkh@...ell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone)
	due to "e084b - page-allocator: preserve PFN ordering when
	__GFP_COLD is set"

On Tue, Mar 02, 2010 at 10:18:27PM +1100, Nick Piggin wrote:
> On Tue, Mar 02, 2010 at 11:01:50AM +0000, Mel Gorman wrote:
> > On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
> > > On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> > > > On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > > > > The zone pressure waitqueue patch makes sense.
> > > > 
> > > > I've just started the rebase and considering what sort of test is best
> > > > for it.
> > > > 
> > > > > We may even want to make
> > > > > it more strictly FIFO (eg. check upfront if there are waiters on the
> > > > > queue before allocating a page, and if yes then add ourself to the back
> > > > > of the waitqueue).
> > > > 
> > > > To be really strict about this, we'd have to check in the hot-path of the
> > > > per-cpu allocator which would be undesirable.
> > > 
> > > Yes, but it would also be desirable for other reasons (eliminating
> > > all unnecessary latency here). Which is why I say it is a tradeoff
> > > but it might be worthwhile.
> > > 
> > > We obviously wouldn't load the waitqueue itself in the fastpath, but
> > > probably some field in a cacheline we already touch.
> > > 
> > 
> > I'm struggling to justify it in my head. Granted, right now the pressure_wq is
> > located beside the wait_table as a "rarely-used" field. It could be located
> > before the free_area[] so it would be pulled in which would have some impact
> > for the high-order lists but probably not too bad. This would make
> > accessing pressure_wq a little cheaper.
> 
> No you would mark it in another field in the slowpath.
> 

i.e. a flag in zone->flags that is set if something is waiting on the queue?

> It would be justified if the aggregate cost of the work is outweighed
> by the reduction in unnecessary wait time in the congestion code, plus
> some unquantifiable goodness from the extra fairness.
> 
> Obviously that depends on the workload, but so does every change we
> do like this.
>  

True. I'll sketch out what needs to happen along these lines and see
what it looks like.

> > The impact of *not* checking is unfairness. A late process makes forward
> > progress while others sleep on a queue waiting for a reclaim pass to finish
> > and the queue to be woken up. At worst, processes continue sleeping for the
> > entire timeout because late processes kept the zone below the watermark.
> > 
> > Is a little unfairness in a path concerned with heavy memory pressure
> > enough to justify checking on every page allocation?
> 
> Well, we've had patches that do a lot more than that in the allocator
> that most people don't really use :) (and they've had relatively small
> impact).
> 
> But no, it adds at least another branch, icache cost, and at least one
> more memory bit in the fastpath. Obviously that's a drawback.
> 

Of which the branch would be the greatest concern but it's also a very
unlikely branch.

> > > > We could check further in the
> > > > slow-path but I bet it'd be very rare that the logic would be triggered. For
> > > > a process to enter the FIFO due to waiters that were not yet woken up, the
> > > > system would have to be a) under heavy memory pressure b) reclaim taking such
> > > > a long time that check_zone_pressure() is not being called in time and c)
> > > > a process exiting or otherwise freeing memory such that the watermarks are
> > > > cleared without reclaim being involved.
> > > 
> > > I don't think it would be too rare. Things can get freed up and
> > > other allocations come in while reclaim is happening. But anyway
> > > the nasty thing about the "rare" events is that they do add a
> > > rare source of unexpected latency or starvation.
> > > 
> > 
> > If processes are asleep on the waitqueue, reclaim must be active (by kswapd
> > if nothing else). If pages are getting freed above the necessary watermark,
> > then the processes will be woken up when the current shrink_zone() finished
> > unless unfair processes are keeping the zone below watermarks.  But unless
> > reclaim is taking an extraordinary long length of time, there would be little
> > difference between waking the queue in the free path and waking it in the
> > reclaim path.
> 
> Reclaim can take quite a while, yes.
> 
>  
> > Again, is a little unfairness under heavy memory pressure enough to
> > alter the main paths?
> 
> Well you didn't let me answer yet :) See above.
> 
>  
> > > I'm not saying necessarily that it will make a noticable improvement
> > > to throughput for this test case, but that now that we have the
> > > waitqueue there, we should think about exactly how far we want to
> > > go with it.
> > 
> > Fair point but I'd sooner look at the other places the VM waits on timeouts
> > and see can they be converted to waiting on events. Just in case, after these
> > tests complete (and assuming they generate usable figures), I'll prototype
> > some of the suggestions and see what the impact is.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/