[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161026220339.GE2699@techsingularity.net>
Date: Wed, 26 Oct 2016 23:03:39 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andy Lutomirski <luto@...capital.net>,
Andreas Gruenbacher <agruenba@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Andy Lutomirski <luto@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Bob Peterson <rpeterso@...hat.com>,
Steven Whitehouse <swhiteho@...hat.com>,
linux-mm <linux-mm@...ck.org>
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit
On Wed, Oct 26, 2016 at 02:26:57PM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <mgorman@...hsingularity.net> wrote:
> >
> > IO wait activity is not all that matters. We hit the lock/unlock paths
> > during a lot of operations like reclaim.
>
> I doubt we do.
>
> Yes, we hit the lock/unlock itself, but do we hit the *contention*?
>
> The current code is nasty, and always ends up touching the wait-queue
> regardless of whether it needs to or not, but we have a fix for that.
>
To be clear, are you referring to PeterZ's patch that avoids the lookup? If
so, I see your point.
> With that fixed, do we actually get contention on a per-page basis?
Reclaim would have to running parallel to migrations, faults, clearing
write-protect etc. I can't think of a situation where a normal workload
would hit it regularly and/or for long durations.
> Because without contention, we'd never actually look up the wait-queue
> at all.
>
> I suspect that without IO, it's really really hard to actually get
> that contention, because things like reclaim end up looking at the LRU
> queue etc wioth their own locking, so it should look at various
> individual pages one at a time, not have multiple queues look at the
> same page.
>
Except many direct reclaimers on small LRUs while a system is thrashing --
not a case that really matters, you've already lost.
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 7f2ae99e5daf..0f088f3a2fed 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -440,33 +440,7 @@ struct zone {
> >> + int initialized;
> >>
> >> /* Write-intensive fields used from the page allocator */
> >> ZONE_PADDING(_pad1_)
> >
> > zone_is_initialized is mostly the domain of hotplug. A potential cleanup
> > is to use a page flag and shrink the size of zone slightly. Nothing to
> > panic over.
>
> I really did that to make it very obvious that there was no semantic
> change. I just set the "initialized" flag in the same place where it
> used to initialize the wait_table, so that this:
>
> >> static inline bool zone_is_initialized(struct zone *zone)
> >> {
> >> - return !!zone->wait_table;
> >> + return zone->initialized;
> >> }
>
> ends up being obviously equivalent.
>
No problem with that.
> >> +#define WAIT_TABLE_BITS 8
> >> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> >> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> >> +
> >> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> >> +{
> >> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> >> + unsigned long val = (unsigned long)word << shift | bit;
> >> +
> >> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> >> +}
> >> +EXPORT_SYMBOL(bit_waitqueue);
> >> +
> >
> > Minor nit that it's unfortunate this moved to the scheduler core. It
> > wouldn't have been a complete disaster to add a page_waitqueue_init() or
> > something similar after sched_init.
>
> I considered that, but decided that "minimal patch" was better. Plus,
> with that bit_waitqueue() actually also being used for the page
> locking queues (which act _kind of_ but not quite, like a bitlock),
> the bit_wait_table is actually more core than just the bit-wait code.
>
> In fact, I considered just renaming it to "hashed_wait_queue", because
> that's effectively how we use it now, rather than being particularly
> specific to the bit-waiting. But again, that would have made the patch
> bigger, which I wanted to avoid since this is a post-rc2 thing due to
> the gfs2 breakage.
>
No objection. Shuffling it around does not make it obviously better in
any way.
In the meantime, a machine freed up. FWIW, it survived booting on a 2-socket
and about 20 minutes of bashing on reclaim paths from multiple processes
to beat on lock/unlock. I didn't do a performance comparison or gather
profile data but I wouldn't expect anything interesting from profiles
other than some cycles saved.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists