linux-kernel - Re: [PATCH] vmscan: limit concurrent reclaimers in shrink

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B2231BF.6040407@redhat.com>
Date:	Fri, 11 Dec 2009 06:49:19 -0500
From:	Larry Woodman <lwoodman@...hat.com>
To:	Rik van Riel <riel@...hat.com>
CC:	kosaki.motohiro@...fujitsu.com, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	aarcange@...hat.com
Subject: Re: [PATCH] vmscan: limit concurrent reclaimers in shrink_zone

Rik van Riel wrote:
> Under very heavy multi-process workloads, like AIM7, the VM can
> get into trouble in a variety of ways.  The trouble start when
> there are hundreds, or even thousands of processes active in the
> page reclaim code.
>
> Not only can the system suffer enormous slowdowns because of
> lock contention (and conditional reschedules) between thousands
> of processes in the page reclaim code, but each process will try
> to free up to SWAP_CLUSTER_MAX pages, even when the system already
> has lots of memory free.  In Larry's case, this resulted in over
> 6000 processes fighting over locks in the page reclaim code, even
> though the system already had 1.5GB of free memory.
>
> It should be possible to avoid both of those issues at once, by
> simply limiting how many processes are active in the page reclaim
> code simultaneously.
>
> If too many processes are active doing page reclaim in one zone,
> simply go to sleep in shrink_zone().
>
> On wakeup, check whether enough memory has been freed already
> before jumping into the page reclaim code ourselves.  We want
> to use the same threshold here that is used in the page allocator
> for deciding whether or not to call the page reclaim code in the
> first place, otherwise some unlucky processes could end up freeing
> memory for the rest of the system.
>
> Reported-by: Larry Woodman <lwoodman@...hat.com>
> Signed-off-by: Rik van Riel <riel@...hat.com>
>
> --- 
> This patch is against today's MMOTM tree. It has only been compile tested,
> I do not have an AIM7 system standing by.
>
> Larry, does this fix your issue?
>
>  Documentation/sysctl/vm.txt |   18 ++++++++++++++++++
>  include/linux/mmzone.h      |    4 ++++
>  include/linux/swap.h        |    1 +
>  kernel/sysctl.c             |    7 +++++++
>  mm/page_alloc.c             |    3 +++
>  mm/vmscan.c                 |   38 ++++++++++++++++++++++++++++++++++++++
>  6 files changed, 71 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..5cf766f 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm:
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- max_zone_concurrent_reclaim
>  - memory_failure_early_kill
>  - memory_failure_recovery
>  - min_free_kbytes
> @@ -278,6 +279,23 @@ The default value is 65536.
>  
>  =============================================================
>  
> +max_zone_concurrent_reclaim:
> +
> +The number of processes that are allowed to simultaneously reclaim
> +memory from a particular memory zone.
> +
> +With certain workloads, hundreds of processes end up in the page
> +reclaim code simultaneously.  This can cause large slowdowns due
> +to lock contention, freeing of way too much memory and occasionally
> +false OOM kills.
> +
> +To avoid these problems, only allow a smaller number of processes
> +to reclaim pages from each memory zone simultaneously.
> +
> +The default value is 8.
> +
> +=============================================================
> +
>  memory_failure_early_kill:
>  
>  Control how to kill processes when uncorrected memory error (typically
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 30fe668..ed614b8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -345,6 +345,10 @@ struct zone {
>  	/* Zone statistics */
>  	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
>  
> +	/* Number of processes running page reclaim code on this zone. */
> +	atomic_t		concurrent_reclaimers;
> +	wait_queue_head_t	reclaim_wait;
> +
>  	/*
>  	 * prev_priority holds the scanning priority for this zone.  It is
>  	 * defined as the scanning priority at which we achieved our reclaim
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..661eec7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,6 +254,7 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  extern int remove_mapping(struct address_space *mapping, struct page *page);
>  extern long vm_total_pages;
> +extern int max_zone_concurrent_reclaimers;
>  
>  #ifdef CONFIG_NUMA
>  extern int zone_reclaim_mode;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6ff0ae6..89b919c 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1270,6 +1270,13 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one,
>  	},
> +	{
> +		.procname	= "max_zone_concurrent_reclaimers",
> +		.data		= &max_zone_concurrent_reclaimers,
> +		.maxlen		= sizeof(max_zone_concurrent_reclaimers),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11ae66e..ca9cae1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3852,6 +3852,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>  
>  		zone->prev_priority = DEF_PRIORITY;
>  
> +		atomic_set(&zone->concurrent_reclaimers, 0);
> +		init_waitqueue_head(&zone->reclaim_wait);
> +
>  		zone_pcp_init(zone);
>  		for_each_lru(l) {
>  			INIT_LIST_HEAD(&zone->lru[l].list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2bbee91..cf3ef29 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/wait.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -129,6 +130,17 @@ struct scan_control {
>  int vm_swappiness = 60;
>  long vm_total_pages;	/* The total number of pages which the VM controls */
>  
> +/*
> + * Maximum number of processes concurrently running the page
> + * reclaim code in a memory zone.  Having too many processes
> + * just results in them burning CPU time waiting for locks,
> + * so we're better off limiting page reclaim to a sane number
> + * of processes at a time.  We do this per zone so local node
> + * reclaim on one NUMA node will not block other nodes from
> + * making progress.
> + */
> +int max_zone_concurrent_reclaimers = 8;
> +
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
> @@ -1600,6 +1612,29 @@ static void shrink_zone(int priority, struct zone *zone,
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int noswap = 0;
>  
> +	if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) >
> +					max_zone_concurrent_reclaimers) {
> +		/*
> +		 * Do not add to the lock contention if this zone has
> +		 * enough processes doing page reclaim already, since
> +		 * we would just make things slower.
> +		 */
> +		sleep_on(&zone->reclaim_wait);
> +
> +		/*
> +		 * If other processes freed enough memory while we waited,
> +		 * break out of the loop and go back to the allocator.
> +		 */
> +		if (zone_watermark_ok(zone, sc->order, low_wmark_pages(zone),
> +					0, 0)) {
> +			wake_up(&zone->reclaim_wait);
> +			sc->nr_reclaimed += nr_to_reclaim;
> +			return;
> +		}
> +	}
> +
> +	atomic_inc(&zone->concurrent_reclaimers);
> +
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (!sc->may_swap || (nr_swap_pages <= 0)) {
>  		noswap = 1;
> @@ -1655,6 +1690,9 @@ static void shrink_zone(int priority, struct zone *zone,
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> +
> +	atomic_dec(&zone->concurrent_reclaimers);
> +	wake_up(&zone->reclaim_wait);
>  }
>  
>  /*
>
>   
FYI everyone, I put together a similar patch(doesnt block explicitly, 
relies on vm_throttle) for
RHEL5(2.6.18 based kernel) and it fixes hangs due to several CPUs 
spinning on zone->lru_lock.
This might eliminate the need to add complexity to the reclaim code.  
Its impossible to make it parallel
enough to allow hundreds or even thousands of processes in this code at 
the same time.

Will backport this patch & test it as well as testing it on latest kernel.

Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/