linux-kernel - Re: [patch v2 4/4] mm, thp: do not perform sync compaction on pagefault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1405020402500.19297@chino.kir.corp.google.com>
Date:	Fri, 2 May 2014 04:22:03 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Mel Gorman <mgorman@...e.de>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	Greg Thelen <gthelen@...gle.com>,
	Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [patch v2 4/4] mm, thp: do not perform sync compaction on
 pagefault

On Fri, 2 May 2014, Mel Gorman wrote:

> > Synchronous memory compaction can be very expensive: it can iterate an enormous 
> > amount of memory without aborting, constantly rescheduling, waiting on page
> > locks and lru_lock, etc, if a pageblock cannot be defragmented.
> > 
> > Unfortunately, it's too expensive for pagefault for transparent hugepages and 
> > it's much better to simply fallback to pages.  On 128GB machines, we find that 
> > synchronous memory compaction can take O(seconds) for a single thp fault.
> > 
> > Now that async compaction remembers where it left off without strictly relying
> > on sync compaction, this makes thp allocations best-effort without causing
> > egregious latency during pagefault.
> > 
> > Signed-off-by: David Rientjes <rientjes@...gle.com>
> 
> Compaction uses MIGRATE_SYNC_LIGHT which in the current implementation
> avoids calling ->writepage to clean dirty page in the fallback migrate
> case
> 
> static int fallback_migrate_page(struct address_space *mapping,
>         struct page *newpage, struct page *page, enum migrate_mode mode)
> {
>         if (PageDirty(page)) {
>                 /* Only writeback pages in full synchronous migration */
>                 if (mode != MIGRATE_SYNC)
>                         return -EBUSY;
>                 return writeout(mapping, page);
>         }
> 
> or waiting on page writeback in other cases
> 
>                 /*
>                  * Only in the case of a full synchronous migration is it
>                  * necessary to wait for PageWriteback. In the async case,
>                  * the retry loop is too short and in the sync-light case,
>                  * the overhead of stalling is too much
>                  */
>                 if (mode != MIGRATE_SYNC) {
>                         rc = -EBUSY;
>                         goto uncharge;
>                 }
> 
> or on acquiring the page lock for unforced migrations.
> 

The page locks I'm referring to is the lock_page() in __unmap_and_move() 
that gets called for sync compaction after the migrate_pages() iteration 
makes a few passes and unsuccessfully grabs it.  This becomes a forced 
migration since __unmap_and_move() returns -EAGAIN when the trylock fails.

> However, buffers still get locked in the SYNC_LIGHT case causing stalls
> in buffer_migrate_lock_buffers which may be undesirable or maybe you are
> hitting some other case.

We have perf profiles from one workload in particular that shows 
contention on i_mmap_mutex (anon isn't interesting since the vast majority 
of memory on this workload [120GB on a 128GB machine] is has a gup pin and 
doesn't get isolated because of 119d6d59dcc0 ("mm, compaction: avoid 
isolating pinned pages")) between cpus all doing memory compaction trying 
to fault thp memory.

That's one example that we've seen, but the fact remains that at times 
sync compaction will iterate the entire 128GB machine and not allow an 
order-9 page to be allocated and there's nothing to preempt it like the 
need_resched() or lock contention checks that async compaction has.  (I 
say "at times" because deferred compaction does help if it's triggered and 
the cached pfn values help, but the worst case scenario is when the entire 
machine is iterated.)  Would it make sense when HPAGE_PMD_ORDER == 
pageblock_order, which is usually the case, that we don't even try to 
migrate pages unless the entire pageblock is successfully isolated and 
gfp_mask & __GFP_NO_KSWAPD?  That's the only other thing that I can think 
that will avoid such a problem other than this patch.

We're struggling to find the logic that ensures sync compaction does not 
become too expensive at pagefault and unless that can be guaranteed, it 
seemed best to avoid it entirely.

> It would be preferable to identify what is
> getting stalled in SYNC_LIGHT compaction and fix that rather than
> disabling it entirely. You may also want to distinguish between a direct
> compaction by a process and collapsing huge pages as done by khugepaged.
> 

I'm referring to latencies of 8-9 seconds to fault 64MB of transparent 
hugepages.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/