linux-kernel - Re: [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when compaction is enabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120112154617.GH3910@csn.ul.ie>
Date:	Thu, 12 Jan 2012 15:46:17 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Rik van Riel <riel@...hat.com>
Cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org,
	KOSAKI Motohiro <kosaki.motohiro@...il.com>,
	Johannes Weiner <hannes@...xchg.org>, hughd@...gle.com,
	aarcange@...hat.com
Subject: Re: [PATCH -mm 1/2] mm: kswapd test order 0 watermarks when
 compaction is enabled

On Mon, Jan 09, 2012 at 09:33:13PM -0500, Rik van Riel wrote:
> When built with CONFIG_COMPACTION, kswapd does not try to free
> contiguous pages.  Because it is not trying, it should also not
> test whether it succeeded, because that can result in continuous
> page reclaim, until a large fraction of memory is free and large
> fractions of the working set have been evicted.
> 

hmm, I'm missing something about your explanation.

1. wakeup_kswapd passes requested order to kswapd_max_order. Bear in
   mind that this does *not* happen for THP.
2. kswapd reads this and passes it to balance_pgdat
3. balance_pgdat puts that in scan_control
4. shrink_zone gets that scan_control and so on

kswapd does try to free contiguous pages.

What is the source of the contiguous allocations of concern? The
mm_vmscan_wakeup_kswapd tracepoint should be able to get you a stack
trace to identify the source of high-order allocations.

To confirm I was not on crazy pills I fired up a systemtap script
that did a burst of order-8 allocations (ok, some crazy pills) and
observed this with tracepoints

          <idle>-0     [002] 236009.284803: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=8
         kswapd0-53    [002] 236009.285028: mm_vmscan_kswapd_wake: nid=0 order=8
         kswapd0-53    [002] 236009.285034: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=9 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285035: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=22 nr_scanned=0 nr_taken=0 contig_taken=0 contig_dirty=0 contig_failed=0
         kswapd0-53    [002] 236009.285038: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=1 nr_scanned=1 nr_taken=1 contig_taken=0 contig_dirty=0 contig_failed=1
         kswapd0-53    [002] 236009.285049: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=32 nr_scanned=32 nr_taken=32 contig_taken=12 contig_dirty=0 contig_failed=20
         kswapd0-53    [002] 236009.285080: mm_vmscan_lru_isolate: isolate_mode=2 order=8 nr_requested=32 nr_scanned=38 nr_taken=38 contig_taken=24 contig_dirty=0 contig_failed=14
         kswapd0-53    [002] 236009.285090: mm_vmscan_lru_isolate: isolate_mode=1 order=8 nr_requested=23 nr_scanned=24 nr_taken=24 contig_taken=4 contig_dirty=0 contig_failed=20

This is with CONFIG_COMPACTION.

You're still in the right area though. kswapd does contiguous-aware
reclaim it does not do any compaction and so potentially it is doing
excessive reclaim while depending on another process to do the
compaction for it. That is a problem.

> Also remove a line of code that increments balanced right before
> exiting the function.
> 
> Signed-off-by: Rik van Riel <riel@...hat.com>
> ---
>  mm/vmscan.c |   22 +++++++++++++++++-----
>  1 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f54a05b..c3eec6b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2608,7 +2608,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab;
> +			int nr_slab, testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2637,11 +2637,25 @@ loop_again:
>  			 * gap is either the low watermark or 1%
>  			 * of the zone, whichever is smaller.
>  			 */
> +			testorder = order;
>  			balance_gap = min(low_wmark_pages(zone),
>  				(zone->present_pages +
>  					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
> -			if (!zone_watermark_ok_safe(zone, order,
> +			/*
> +			 * Kswapd reclaims only single pages when
> +			 * COMPACTION_BUILD. Trying too hard to get
> +			 * contiguous free pages can result in excessive
> +			 * amounts of free memory, and useful things
> +			 * getting kicked out of memory.
> +			 * Limit the amount of reclaim to something sane,
> +			 * plus space for compaction to do its thing.
> +			 */
> +			if (COMPACTION_BUILD) {
> +				testorder = 0;
> +				balance_gap += 2<<order;
> +			}
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone) + balance_gap,
>  					end_zone, 0)) {

kswapd does reclaim high-order pages so this comment is misleading.
However I see the type of problem you are talking about.

Direct reclaim in shrink_zones() does a check for compaction_suitable()
when deciding whether to abort reclaim or not. How about doing the same
for kswapd and if compaction can go ahead, goto out?

>  				shrink_zone(priority, zone, &sc);
> @@ -2670,7 +2684,7 @@ loop_again:
>  				continue;
>  			}
>  
> -			if (!zone_watermark_ok_safe(zone, order,
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;
>  				/*
> @@ -2776,8 +2790,6 @@ out:
>  
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
> -			if (i <= *classzone_idx)
> -				balanced += zone->present_pages;
>  		}
>  	}
>  
> 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/