linux-kernel - Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100309150332.GP8653@laptop>
Date:	Wed, 10 Mar 2010 02:03:32 +1100
From:	Nick Piggin <npiggin@...e.de>
To:	Mel Gorman <mel@....ul.ie>
Cc:	linux-mm@...ck.org,
	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>,
	Chris Mason <chris.mason@...cle.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] page-allocator: Under memory pressure, wait on
 pressure to relieve instead of congestion

On Tue, Mar 09, 2010 at 02:17:13PM +0000, Mel Gorman wrote:
> On Wed, Mar 10, 2010 at 12:35:13AM +1100, Nick Piggin wrote:
> > On Mon, Mar 08, 2010 at 11:48:21AM +0000, Mel Gorman wrote:
> > > Under heavy memory pressure, the page allocator may call congestion_wait()
> > > to wait for IO congestion to clear or a timeout. This is not as sensible
> > > a choice as it first appears. There is no guarantee that BLK_RW_ASYNC is
> > > even congested as the pressure could have been due to a large number of
> > > SYNC reads and the allocator waits for the entire timeout, possibly uselessly.
> > > 
> > > At the point of congestion_wait(), the allocator is struggling to get the
> > > pages it needs and it should back off. This patch puts the allocator to sleep
> > > on a zone->pressure_wq for either a timeout or until a direct reclaimer or
> > > kswapd brings the zone over the low watermark, whichever happens first.
> > > 
> > > Signed-off-by: Mel Gorman <mel@....ul.ie>
> > > ---
> > >  include/linux/mmzone.h |    3 ++
> > >  mm/internal.h          |    4 +++
> > >  mm/mmzone.c            |   47 +++++++++++++++++++++++++++++++++++++++++++++
> > >  mm/page_alloc.c        |   50 +++++++++++++++++++++++++++++++++++++++++++----
> > >  mm/vmscan.c            |    2 +
> > >  5 files changed, 101 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 30fe668..72465c1 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -398,6 +398,9 @@ struct zone {
> > >  	unsigned long		wait_table_hash_nr_entries;
> > >  	unsigned long		wait_table_bits;
> > >  
> > > +	/* queue for processes waiting for pressure to relieve */
> > > +	wait_queue_head_t	*pressure_wq;
> > 
> > Hmm, processes may be eligible to allocate from > 1 zone, but you
> > have them only waiting for one. I wonder if we shouldn't wait for
> > more zones?
> > 
> 
> It's waiting for the zone that is most desirable. If that zones watermarks
> are met, why would it wait on any other zone?

I mean the other way around. If that zone's watermarks are not met, then
why shouldn't it be woken up by other zones reaching their watermarks.


> If you mean that it would
> wait for any of the eligible zones to meet their watermark, it might have
> an impact on NUMA locality but it could be managed. It might make sense to
> wait on a node-based queue rather than a zone if this behaviour was desirable.
> 
> > Congestion waiting uses a global waitqueue, which hasn't seemed to
> > cause a big scalability problem. Would it be better to have a global
> > waitqueue for this too?
> > 
> 
> Considering that the congestion wait queue is for a relatively slow operation,
> it would be surprising if lock scalability was noticeable.  Potentially the
> pressure_wq involves no IO so scalability may be noticeable there.
> 
> What would the advantages of a global waitqueue be? Obviously, a smaller
> memory footprint. A second potential advantage is that on wakeup, it
> could check the watermarks on multiple zones which might reduce
> latencies in some cases. Can you think of more compelling reasons?

Your 2nd advantage is what I mean above.


> > 
> > > +void check_zone_pressure(struct zone *zone)
> > 
> > I don't really like the name pressure. We use that term for the reclaim
> > pressure wheras we're just checking watermarks here (actual pressure
> > could be anything).
> > 
> 
> pressure_wq => watermark_wq
> check_zone_pressure => check_watermark_wq
> 
> ?

Thanks.

> 
> > 
> > > +{
> > > +	/* If no process is waiting, nothing to do */
> > > +	if (!waitqueue_active(zone->pressure_wq))
> > > +		return;
> > > +
> > > +	/* Check if the high watermark is ok for order 0 */
> > > +	if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
> > > +		wake_up_interruptible(zone->pressure_wq);
> > > +}
> > 
> > If you were to do this under the zone lock (in your subsequent patch),
> > then it could avoid races. I would suggest doing it all as a single
> > patch and not doing the pressure checks in reclaim at all.
> > 
> 
> That is reasonable. I've already dropped the checks in reclaim because as you
> say, if the free path check is cheap enough, it's also sufficient. Checking
> in the reclaim paths as well is redundant.
> 
> I'll move the call to check_zone_pressure() within the zone lock to avoid
> races.
> 
> > If you are missing anything, then that needs to be explained and fixed
> > rather than just adding extra checks.
> > 
> > > +
> > > +/**
> > > + * zonepressure_wait - Wait for pressure on a zone to ease off
> > > + * @zone: The zone that is expected to be under pressure
> > > + * @order: The order the caller is waiting on pages for
> > > + * @timeout: Wait until pressure is relieved or this timeout is reached
> > > + *
> > > + * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
> > > + * It's considered to be relieved if any direct reclaimer or kswapd brings
> > > + * the zone above the high watermark
> > > + */
> > > +long zonepressure_wait(struct zone *zone, unsigned int order, long timeout)
> > > +{
> > > +	long ret;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +wait_again:
> > > +	prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
> > 
> > I guess to do it without races you need to check watermark here.
> > And possibly some barriers if it is done without zone->lock.
> > 
> 
> As watermark checks are already done without the zone->lock and without
> barriers, why are they needed here? Yes, there are small races. For
> example, it's possible to hit a window where pages were freed between
> watermarks were checked and we went to sleep here but that is similar to
> current behaviour.

Well with the check in free_pages_bulk then doing another check here
before the wait should be able to close all lost-wakeup races. I agree
it is pretty fuzzy heuristics anyway, so these races don't *really*
matter a lot. But it seems easy to close the races, so I don't see
why not.


> > > +
> > > +	/*
> > > +	 * The use of io_schedule_timeout() here means that it gets
> > > +	 * accounted for as IO waiting. This may or may not be the case
> > > +	 * but at least this way it gets picked up by vmstat
> > > +	 */
> > > +	ret = io_schedule_timeout(timeout);
> > > +	finish_wait(zone->pressure_wq, &wait);
> > > +
> > > +	/* If woken early, check watermarks before continuing */
> > > +	if (ret && !zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0)) {
> > > +		timeout = ret;
> > > +		goto wait_again;
> > > +	}
> > 
> > And then I don't know if we'd really need the extra check here. Might as
> > well just let the allocator try again and avoid the code?
> > 
> 
> I was considering multiple processes been woken up and racing with each
> other. I can drop this check though. The worst that happens is multiple
> processes wake and walk the full zonelist. Some will succeed and others
> will go back to sleep.

Yep. And it doesn't really solve that race either becuase the zone
might subsequently go below the watermark.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/