linux-kernel - [PATCH 0/7] Use memory compaction instead of lumpy reclaim during high-order allocations V2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1290440635-30071-1-git-send-email-mel@csn.ul.ie>
Date:	Mon, 22 Nov 2010 15:43:48 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Andrea Arcangeli <aarcange@...hat.com>
Cc:	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Mel Gorman <mel@....ul.ie>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH 0/7] Use memory compaction instead of lumpy reclaim during high-order allocations V2

Changelog since V1
  o Drop patch that takes a scanning hint from LRU
  o Loop in reclaim until it is known that enough pages are reclaimed for
    compaction to make forward progress or that progress is no longer
    possible
  o Do not call compaction from within reclaim. Instead have the allocator
    or kswapd call it as necessary
  o Obeying sync in migration now means just avoiding wait_on_page_writeback

Huge page allocations are not expected to be cheap but lumpy reclaim
is still very disruptive. While it is far better than reclaiming random
order-0 pages, it ignores the reference bit of pages near the reference
page selected from the LRU. Memory compaction was merged in 2.6.35 to use
less lumpy reclaim by moving pages around instead of reclaiming when there
were enough pages free. It has been tested fairly heavily at this point.
This is a prototype series to use compaction more aggressively.

When CONFIG_COMPACTION is set, lumpy reclaim is no longer used. Instead,
a mechanism called reclaim/compaction is used where a number of order-0
pages are reclaimed and later the caller uses compaction to satisfy the
allocation. This keeps a larger number of active pages in memory at the cost
of increased use of migration and compaction scanning. With the full series
applied, latencies when allocating huge pages are significantly reduced. By
the end of the series, hints are taken from the LRU on where the best place
to start migrating from might be.

Andrea, this version calls compaction from the callers instead of within
reclaim. Your main concern before was that compaction was being called after
a blind reclaim without checking if enough reclaim work had occurred. This
version is better at checking if enough work has been done but the callers
of compaction are a little awkward. I'm wondering if it really does make
more sense to call compact_zone_order() if should_continue_reclaim() returns
false and indications are that compaction would have a successful outcome.

Four kernels are tested

traceonly		This kernel is using compaction and has the
			tracepoints applied.

reclaimcompact		First three patches. A number of order-0 pages
			are applied and then the zone is compacted. This
			replaces lumpy reclaim but lumpy reclaim is still
			available if compaction is unset.

obeysync		First five patches. Migration will avoid the use
			of wait_on_page_writeback() if requested by the
			caller.

fastscan		First six patches applied. try_to_compact_pages()
			uses shortcuts in the faster compaction path to
			reduce latency.

The final patch is just a rename so it is not reported.  The target test was
a high-order allocation stress test. Testing was based on kernel 2.6.37-rc2.
The test machine was x86-64 with 3G of RAM.

STRESS-HIGHALLOC
                traceonly         reclaimcompact     obeysync         fastscan
Pass 1          90.00 ( 0.00%)    80.00 (-10.00%)    84.00 (-6.00%)   82.00 (-8.00%)
Pass 2          92.00 ( 0.00%)    82.00 (-10.00%)    86.00 (-6.00%)   86.00 (-6.00%)
At Rest         94.00 ( 0.00%)    93.00 (-1.00%)     95.00 ( 1.00%)   93.00 (-1.00%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3359.07   3284.68    3299.3   3292.66
Total Elapsed Time (seconds)               2120.23   1329.19   1314.64   1312.75


Success rates are slightly down at the gain of faster completion times. This
is related to the patches reducing the amount of latency and the work
performed by reclaim. The success figures can be matched but the system
gets hammered more. As the success rates are still very high, it's not
worth the overhead. All in all, the test completes 15 minutes faster which
is a pretty decent improvement.

FTrace Reclaim Statistics: vmscan
                                         traceonly reclaimcompact obeysync fastscan
Direct reclaims                                403        704        757        648 
Direct reclaim pages scanned                 62655     734125     718325     621864 
Direct reclaim pages reclaimed               36445     186805     214376     187671 
Direct reclaim write file async I/O           2090        748        517        561 
Direct reclaim write anon async I/O           9850       8089       5704       4307 
Direct reclaim write file sync I/O               1          0          0          0 
Direct reclaim write anon sync I/O              70          1          1          0 
Wake kswapd requests                           768       1061        890        979 
Kswapd wakeups                                 581        439        451        423 
Kswapd pages scanned                       4566808    2421272    2284775    2349758 
Kswapd pages reclaimed                     2338283    1580849    1558239    1559380 
Kswapd reclaim write file async I/O          48287        858        673        649 
Kswapd reclaim write anon async I/O         755369       3327       3964       4037 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)       104.13      41.53      71.18      53.77 
Time kswapd awake (seconds)                 891.88     233.58     199.42     212.52 

Total pages scanned                        4629463   3155397   3003100   2971622
Total pages reclaimed                      2374728   1767654   1772615   1747051
%age total pages scanned/reclaimed          51.30%    56.02%    59.03%    58.79%
%age total pages scanned/written            17.62%     0.41%     0.36%     0.32%
%age  file pages scanned/written             1.09%     0.05%     0.04%     0.04%
Percentage Time Spent Direct Reclaim         3.01%     1.25%     2.11%     1.61%
Percentage Time kswapd Awake                42.07%    17.57%    15.17%    16.19%

These are the reclaim statistics. The time spent in direct reclaim and
with kswapd is reduced as well as less overall reclaim activity (2.4G less
worth of pages reclaimed). It looks like obeysync increases the stall time
for direct reclaimers.  This could be reduced by having kswapd use sync
compaction but the preceived ideal was that it is better for kswapd to
continually make forward progress.

FTrace Reclaim Statistics: compaction
                                        traceonly reclaimcompact obeysync  fastscan
Migrate Pages Scanned                     83190294 1277116960  955517979  927209597 
Migrate Pages Isolated                      245208    4068555    3173644    3920101 
Free    Pages Scanned                     25488658  597156637  668273710  927901903 
Free    Pages Isolated                      335004    4575669    3597552    4408042 
Migrated Pages                              241260    4018215    3123549    3865212 
Migration Failures                            3948      50340      50095      54863 

The patch series increases the amount of compaction activity but this is not
surprising as there are more callers. Once reclaim/compaction is introduced,
the remainder of the series reduces the work slightly. This work doesn't
show up in the latency figures as such but it's trashing cache. Future work
may look at reducing the amount of scanning that is performed by compaction.

The raw figures are convincing enough in terms of the test completes faster
but we really care about latencies so here are the average latencies when
allocating huge pages.

X86-64
http://www.csn.ul.ie/~mel/postings/memorycompact-20101122/highalloc-interlatency-hydra-mean.ps
http://www.csn.ul.ie/~mel/postings/memorycompact-20101122/highalloc-interlatency-hydra-stddev.ps

The mean latencies are pushed *way* down implying that the amount of work
to allocate each huge page is drastically reduced. 

 include/linux/compaction.h        |   20 ++++-
 include/linux/kernel.h            |    7 ++
 include/linux/migrate.h           |   12 ++-
 include/trace/events/compaction.h |   74 +++++++++++++++++
 include/trace/events/vmscan.h     |    6 +-
 mm/compaction.c                   |  132 ++++++++++++++++++++++---------
 mm/memory-failure.c               |    3 +-
 mm/memory_hotplug.c               |    3 +-
 mm/mempolicy.c                    |    6 +-
 mm/migrate.c                      |   22 +++--
 mm/page_alloc.c                   |   32 +++++++-
 mm/vmscan.c                       |  157 ++++++++++++++++++++++++++++---------
 12 files changed, 371 insertions(+), 103 deletions(-)
 create mode 100644 include/trace/events/compaction.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/