linux-kernel - Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120927121457.GC3429@suse.de>
Date:	Thu, 27 Sep 2012 13:14:57 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Minchan Kim <minchan@...nel.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Richard Davies <richard@...chsys.com>,
	Shaohua Li <shli@...nel.org>, Rik van Riel <riel@...hat.com>,
	Avi Kivity <avi@...hat.com>,
	QEMU-devel <qemu-devel@...gnu.org>, KVM <kvm@...r.kernel.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and
 no pages were isolated

On Wed, Sep 26, 2012 at 09:49:30AM +0900, Minchan Kim wrote:
> On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> > On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > > On Mon, 24 Sep 2012 10:39:38 +0100
> > > Mel Gorman <mgorman@...e.de> wrote:
> > > 
> > > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > > > 
> > > > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > > that pageblock which justifies repolling it?
> > > > > 
> > > > 
> > > > The "something" event you are looking for is pages being freed or
> > > > allocated in the page allocator. A movable page being allocated in block
> > > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > > Unfortunately this would impact the fast path of the alloc and free paths
> > > > of the page allocator. I felt that that was too high a price to pay.
> > > 
> > > We already do a similar thing in the page allocator: clearing of
> > > ->all_unreclaimable and ->pages_scanned. 
> > 
> > That is true but that is a simple write (shared cache line but still) to
> > a struct zone. Worse, now that you point it out, that's pretty stupid. It
> > should be checking if the value is non-zero before writing to it to avoid
> > a cache line bounce.
> > 
> > Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> > not as cheap as it needs to
> > 
> > set_pageblock_skip
> >   -> set_pageblock_flags_group
> >     -> page_zone
> >     -> page_to_pfn
> >     -> get_pageblock_bitmap
> >     -> pfn_to_bitidx
> >     -> __set_bit
> > 
> > > But that isn't on the "fast
> > > path" really - it happens once per pcp unload. 
> > 
> > That's still an important enough path that I'm wary of making it fatter
> > and that only covers the free path. To avoid the polling, the allocation
> > side needs to be handled too. It could be shoved down into rmqueue() to
> > put it into a slightly colder path but still, it's a price to pay to keep
> > compaction happy.
> > 
> > > Can we do something
> > > like that?  Drop some hint into the zone without having to visit each
> > > page?
> > > 
> > 
> > Not without incurring a cost, but yes, t is possible to give a hint on when
> > PG_migrate_skip should be cleared and move away from that time-based hammer.
> > 
> > First, we'd introduce a variant of get_pageblock_migratetype() that returns
> > all the bits for the pageblock flags and then helpers to extract either the
> > migratetype or the PG_migrate_skip. We already are incurring the cost of
> > get_pageblock_migratetype() so it will not be much more expensive than what
> > is already there. If there is an allocation or free within a pageblock that
> > as the PG_migrate_skip bit set then we increment a counter. When the counter
> > reaches some to-be-decided "threshold" then compaction may clear all the
> > bits. This would match the criteria of the clearing being based on activity.
> > 
> > There are four potential problems with this
> > 
> > 1. The logic to retrieve all the bits and split them up will be a little
> >    convulated but maybe it would not be that bad.
> > 
> > 2. The counter is a shared-writable cache line but obviously it could
> >    be moved to vmstat and incremented with inc_zone_page_state to offset
> >    the cost a little.
> > 
> > 3. The biggested weakness is that there is not way to know if the
> >    counter is incremented based on activity in a small subset of blocks.
> > 
> > 4. What should the threshold be?
> > 
> > The first problem is minor but the other three are potentially a mess.
> > Adding another vmstat counter is bad enough in itself but if the counter
> > is incremented based on a small subsets of pageblocks, the hint becomes
> > is potentially useless.
> 
> Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
> in pageblock_flags_group.
> In allocation path, we can set PG_check_migrate in a pageblock
> In free path, we can set PG_check_free in a pageblock.
> And they are cleared by compaction's scan like now.
> So we can discard 3 and 4 at least.
> 

Adding a second bit does not fix problem 3 or problem 4 at all. With two
bits, all activity could be concentrated on two blocks - one migrate and
one free. The threshold still has to be selected.

> Another idea is that let's cure it by fixing fundamental problem.
> Make zone's locks more fine-grained.

Far easier said than done and only covers the contention problem. It
does nothing for the scanning problem.

> As time goes by, system uses bigger memory but our lock of zone
> isn't scalable. Recently, lru_lock and zone->lock contention report
> isn't rare so i think it's good time that we move next step.
> 

Lock contention on both those locks recently were due to compaction
rather than something more fundamental.

> How about defining struct sub_zone per 2G or 4G?
> so a zone can have several sub_zone as size and subzone can replace
> current zone's role and zone is just container of subzones.
> Of course, it's not easy to implement but I think someday we should
> go that way. Is it a really overkill?
> 

One one side that greatly increases the cost of the page allocator and
the size of the zonelist it must walk as it'll need additional walks for
each of these lists. The interaction with fragmentation avoidance and
how it handles fallbacks would be particularly problematic. On the other
side, multiple sub-zones will also introduce multiple LRUs making the
existing balancing problem considerably worse.

And again, all this would be aimed at contention and do nothing for the
scanning problem at hand.

That introduces a multiple LRUs that must be balanced problem. 

I'm work on a patch that removes the time heuristic that I think might
work. Will hopefully post it today.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/