linux-kernel - Re: [ 39/40] cpuset: mm: reduce large amounts of memory barrier related damage v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 27 Jul 2012 12:08:24 -0300
From:	Herton Ronaldo Krzesinski <herton.krzesinski@...onical.com>
To:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc:	linux-kernel@...r.kernel.org, stable@...r.kernel.org,
	torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
	alan@...rguk.ukuu.org.uk, Mel Gorman <mgorman@...e.de>,
	Miao Xie <miaox@...fujitsu.com>,
	David Rientjes <rientjes@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Christoph Lameter <cl@...ux.com>
Subject: Re: [ 39/40] cpuset: mm: reduce large amounts of memory barrier
 related damage v3

On Thu, Jul 26, 2012 at 02:29:57PM -0700, Greg Kroah-Hartman wrote:
> From: Greg KH <gregkh@...uxfoundation.org>
> 
> 3.0-stable review patch.  If anyone has any objections, please let me know.
> 
> ------------------
> 
> From: Mel Gorman <mgorman@...e.de>
> 
> commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
> 
> Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
> 	expensive and severely impacted page allocator performance. This
> 	is part of a series of patches that reduce page allocator overhead.
> 
> Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
> changing cpuset's mems") wins a super prize for the largest number of
> memory barriers entered into fast paths for one commit.
> 
> [get|put]_mems_allowed is incredibly heavy with pairs of full memory
> barriers inserted into a number of hot paths.  This was detected while
> investigating at large page allocator slowdown introduced some time
> after 2.6.32.  The largest portion of this overhead was shown by
> oprofile to be at an mfence introduced by this commit into the page
> allocator hot path.
> 
> For extra style points, the commit introduced the use of yield() in an
> implementation of what looks like a spinning mutex.
> 
> This patch replaces the full memory barriers on both read and write
> sides with a sequence counter with just read barriers on the fast path
> side.  This is much cheaper on some architectures, including x86.  The
> main bulk of the patch is the retry logic if the nodemask changes in a
> manner that can cause a false failure.
> 
> While updating the nodemask, a check is made to see if a false failure
> is a risk.  If it is, the sequence number gets bumped and parallel
> allocators will briefly stall while the nodemask update takes place.
> 
> In a page fault test microbenchmark, oprofile samples from
> __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
> actual results were
> 
>                              3.3.0-rc3          3.3.0-rc3
>                              rc3-vanilla        nobarrier-v2r1
>     Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
>     Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
>     Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
>     Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
>     Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
>     Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
>     Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
>     Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
>     Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
>     Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
>     Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
>     Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
>     Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
>     Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
>     Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
>     MMTests Statistics: duration
>     Sys Time Running Test (seconds)             135.68    132.17
>     User+Sys Time Running Test (seconds)         164.2    160.13
>     Total Elapsed Time (seconds)                123.46    120.87
> 
> The overall improvement is small but the System CPU time is much
> improved and roughly in correlation to what oprofile reported (these
> performance figures are without profiling so skew is expected).  The
> actual number of page faults is noticeably improved.
> 
> For benchmarks like kernel builds, the overall benefit is marginal but
> the system CPU time is slightly reduced.
> 
> To test the actual bug the commit fixed I opened two terminals.  The
> first ran within a cpuset and continually ran a small program that
> faulted 100M of anonymous data.  In a second window, the nodemask of the
> cpuset was continually randomised in a loop.
> 
> Without the commit, the program would fail every so often (usually
> within 10 seconds) and obviously with the commit everything worked fine.
> With this patch applied, it also worked fine so the fix should be
> functionally equivalent.
> 
> Signed-off-by: Mel Gorman <mgorman@...e.de>
> Cc: Miao Xie <miaox@...fujitsu.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Cc: Christoph Lameter <cl@...ux.com>
> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
> Signed-off-by: Mel Gorman <mgorman@...e.de>
> Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> 
> 
> ---
>  include/linux/cpuset.h    |   49 ++++++++++++++++++----------------------------
>  include/linux/init_task.h |    8 +++++++
>  include/linux/sched.h     |    2 -
>  kernel/cpuset.c           |   43 +++++++---------------------------------
>  kernel/fork.c             |    3 ++
>  mm/filemap.c              |   11 ++++++----
>  mm/hugetlb.c              |   15 ++++++++++----
>  mm/mempolicy.c            |   28 +++++++++++++++++++-------
>  mm/page_alloc.c           |   33 +++++++++++++++++++++---------
>  mm/slab.c                 |   13 +++++++-----
>  mm/slub.c                 |   35 +++++++++++++++++++++-----------
>  mm/vmscan.c               |    2 -
>  12 files changed, 133 insertions(+), 109 deletions(-)
> 
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -89,36 +89,25 @@ extern void rebuild_sched_domains(void);
>  extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>  
>  /*
> - * reading current mems_allowed and mempolicy in the fastpath must protected
> - * by get_mems_allowed()
> + * get_mems_allowed is required when making decisions involving mems_allowed
> + * such as during page allocation. mems_allowed can be updated in parallel
> + * and depending on the new value an operation can fail potentially causing
> + * process failure. A retry loop with get_mems_allowed and put_mems_allowed
> + * prevents these artificial failures.
>   */
> -static inline void get_mems_allowed(void)
> +static inline unsigned int get_mems_allowed(void)
>  {
> -	current->mems_allowed_change_disable++;
> +	return read_seqcount_begin(&current->mems_allowed_seq);
> +}
>  
> -	/*
> -	 * ensure that reading mems_allowed and mempolicy happens after the
> -	 * update of ->mems_allowed_change_disable.
> -	 *
> -	 * the write-side task finds ->mems_allowed_change_disable is not 0,
> -	 * and knows the read-side task is reading mems_allowed or mempolicy,
> -	 * so it will clear old bits lazily.
> -	 */
> -	smp_mb();
> -}
> -
> -static inline void put_mems_allowed(void)
> -{
> -	/*
> -	 * ensure that reading mems_allowed and mempolicy before reducing
> -	 * mems_allowed_change_disable.
> -	 *
> -	 * the write-side task will know that the read-side task is still
> -	 * reading mems_allowed or mempolicy, don't clears old bits in the
> -	 * nodemask.
> -	 */
> -	smp_mb();
> -	--ACCESS_ONCE(current->mems_allowed_change_disable);
> +/*
> + * If this returns false, the operation that took place after get_mems_allowed
> + * may have failed. It is up to the caller to retry the operation if
> + * appropriate.
> + */
> +static inline bool put_mems_allowed(unsigned int seq)
> +{
> +	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
>  }
>  
>  static inline void set_mems_allowed(nodemask_t nodemask)
> @@ -234,12 +223,14 @@ static inline void set_mems_allowed(node
>  {
>  }
>  
> -static inline void get_mems_allowed(void)
> +static inline unsigned int get_mems_allowed(void)
>  {
> +	return 0;
>  }
>  
> -static inline void put_mems_allowed(void)
> +static inline bool put_mems_allowed(unsigned int seq)
>  {
> +	return true;
>  }
>  
>  #endif /* !CONFIG_CPUSETS */
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -30,6 +30,13 @@ extern struct fs_struct init_fs;
>  #define INIT_THREADGROUP_FORK_LOCK(sig)
>  #endif
>  
> +#ifdef CONFIG_CPUSETS
> +#define INIT_CPUSET_SEQ							\
> +	.mems_allowed_seq = SEQCNT_ZERO,
> +#else
> +#define INIT_CPUSET_SEQ
> +#endif
> +
>  #define INIT_SIGNALS(sig) {						\
>  	.nr_threads	= 1,						\
>  	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
> @@ -193,6 +200,7 @@ extern struct cred init_cred;
>  	INIT_FTRACE_GRAPH						\
>  	INIT_TRACE_RECURSION						\
>  	INIT_TASK_RCU_PREEMPT(tsk)					\
> +	INIT_CPUSET_SEQ							\
>  }
>  
>  
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1484,7 +1484,7 @@ struct task_struct {
>  #endif
>  #ifdef CONFIG_CPUSETS
>  	nodemask_t mems_allowed;	/* Protected by alloc_lock */
> -	int mems_allowed_change_disable;
> +	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
>  	int cpuset_mem_spread_rotor;
>  	int cpuset_slab_spread_rotor;
>  #endif
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(
>  {
>  	bool need_loop;
>  
> -repeat:
>  	/*
>  	 * Allow tasks that have access to memory reserves because they have
>  	 * been OOM killed to get memory anywhere.
> @@ -983,45 +982,19 @@ repeat:
>  	 */
>  	need_loop = task_has_mempolicy(tsk) ||
>  			!nodes_intersects(*newmems, tsk->mems_allowed);
> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> -	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
>  
> -	/*
> -	 * ensure checking ->mems_allowed_change_disable after setting all new
> -	 * allowed nodes.
> -	 *
> -	 * the read-side task can see an nodemask with new allowed nodes and
> -	 * old allowed nodes. and if it allocates page when cpuset clears newly
> -	 * disallowed ones continuous, it can see the new allowed bits.
> -	 *
> -	 * And if setting all new allowed nodes is after the checking, setting
> -	 * all new allowed nodes and clearing newly disallowed ones will be done
> -	 * continuous, and the read-side task may find no node to alloc page.
> -	 */
> -	smp_mb();
> -
> -	/*
> -	 * Allocation of memory is very fast, we needn't sleep when waiting
> -	 * for the read-side.
> -	 */
> -	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> -		task_unlock(tsk);
> -		if (!task_curr(tsk))
> -			yield();
> -		goto repeat;
> -	}
> +	if (need_loop)
> +		write_seqcount_begin(&tsk->mems_allowed_seq);
>  
> -	/*
> -	 * ensure checking ->mems_allowed_change_disable before clearing all new
> -	 * disallowed nodes.
> -	 *
> -	 * if clearing newly disallowed bits before the checking, the read-side
> -	 * task may find no node to alloc page.
> -	 */
> -	smp_mb();
> +	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> +	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
>  
>  	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
>  	tsk->mems_allowed = *newmems;
> +
> +	if (need_loop)
> +		write_seqcount_end(&tsk->mems_allowed_seq);
> +
>  	task_unlock(tsk);
>  }
>  
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -985,6 +985,9 @@ static int copy_signal(unsigned long clo
>  #ifdef CONFIG_CGROUPS
>  	init_rwsem(&sig->threadgroup_fork_lock);
>  #endif
> +#ifdef CONFIG_CPUSETS
> +	seqcount_init(&tsk->mems_allowed_seq);
> +#endif
>  
>  	sig->oom_adj = current->signal->oom_adj;
>  	sig->oom_score_adj = current->signal->oom_score_adj;
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -516,10 +516,13 @@ struct page *__page_cache_alloc(gfp_t gf
>  	struct page *page;
>  
>  	if (cpuset_do_page_mem_spread()) {
> -		get_mems_allowed();
> -		n = cpuset_mem_spread_node();
> -		page = alloc_pages_exact_node(n, gfp, 0);
> -		put_mems_allowed();
> +		unsigned int cpuset_mems_cookie;
> +		do {
> +			cpuset_mems_cookie = get_mems_allowed();
> +			n = cpuset_mem_spread_node();
> +			page = alloc_pages_exact_node(n, gfp, 0);
> +		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
> +
>  		return page;
>  	}
>  	return alloc_pages(gfp, 0);
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vm
>  				struct vm_area_struct *vma,
>  				unsigned long address, int avoid_reserve)
>  {
> -	struct page *page = NULL;
> +	struct page *page;
>  	struct mempolicy *mpol;
>  	nodemask_t *nodemask;
>  	struct zonelist *zonelist;
>  	struct zone *zone;
>  	struct zoneref *z;
> +	unsigned int cpuset_mems_cookie;
>  
> -	get_mems_allowed();
> +retry_cpuset:
> +	cpuset_mems_cookie = get_mems_allowed();
>  	zonelist = huge_zonelist(vma, address,
>  					htlb_alloc_mask, &mpol, &nodemask);
>  	/*
> @@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vm
>  			}
>  		}
>  	}
> -err:
> +
>  	mpol_cond_put(mpol);
> -	put_mems_allowed();
> +	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		goto retry_cpuset;
>  	return page;
> +
> +err:
> +	mpol_cond_put(mpol);
> +	return NULL;
>  }
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1810,18 +1810,24 @@ struct page *
>  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		unsigned long addr, int node)
>  {
> -	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct mempolicy *pol;
>  	struct zonelist *zl;
>  	struct page *page;
> +	unsigned int cpuset_mems_cookie;
> +
> +retry_cpuset:
> +	pol = get_vma_policy(current, vma, addr);
> +	cpuset_mems_cookie = get_mems_allowed();
>  
> -	get_mems_allowed();
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
>  		mpol_cond_put(pol);
>  		page = alloc_page_interleave(gfp, order, nid);
> -		put_mems_allowed();
> +		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +			goto retry_cpuset;
> +
>  		return page;
>  	}
>  	zl = policy_zonelist(gfp, pol, node);
> @@ -1832,7 +1838,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  		struct page *page =  __alloc_pages_nodemask(gfp, order,
>  						zl, policy_nodemask(gfp, pol));
>  		__mpol_put(pol);
> -		put_mems_allowed();
> +		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +			goto retry_cpuset;
>  		return page;
>  	}
>  	/*
> @@ -1840,7 +1847,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
>  	 */
>  	page = __alloc_pages_nodemask(gfp, order, zl,
>  				      policy_nodemask(gfp, pol));
> -	put_mems_allowed();
> +	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		goto retry_cpuset;
>  	return page;
>  }
>  
> @@ -1867,11 +1875,14 @@ struct page *alloc_pages_current(gfp_t g
>  {
>  	struct mempolicy *pol = current->mempolicy;
>  	struct page *page;
> +	unsigned int cpuset_mems_cookie;
>  
>  	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
>  		pol = &default_policy;
>  
> -	get_mems_allowed();
> +retry_cpuset:
> +	cpuset_mems_cookie = get_mems_allowed();
> +
>  	/*
>  	 * No reference counting needed for current->mempolicy
>  	 * nor system default_policy
> @@ -1882,7 +1893,10 @@ struct page *alloc_pages_current(gfp_t g
>  		page = __alloc_pages_nodemask(gfp, order,
>  				policy_zonelist(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
> -	put_mems_allowed();
> +
> +	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		goto retry_cpuset;
> +
>  	return page;
>  }
>  EXPORT_SYMBOL(alloc_pages_current);
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2293,8 +2293,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  {
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  	struct zone *preferred_zone;
> -	struct page *page;
> +	struct page *page = NULL;
>  	int migratetype = allocflags_to_migratetype(gfp_mask);
> +	unsigned int cpuset_mems_cookie;
>  
>  	gfp_mask &= gfp_allowed_mask;
>  
> @@ -2313,15 +2314,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  	if (unlikely(!zonelist->_zonerefs->zone))
>  		return NULL;
>  
> -	get_mems_allowed();
> +retry_cpuset:
> +	cpuset_mems_cookie = get_mems_allowed();
> +
>  	/* The preferred zone is used for statistics later */
>  	first_zones_zonelist(zonelist, high_zoneidx,
>  				nodemask ? : &cpuset_current_mems_allowed,
>  				&preferred_zone);
> -	if (!preferred_zone) {
> -		put_mems_allowed();
> -		return NULL;
> -	}
> +	if (!preferred_zone)
> +		goto out;
>  
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> @@ -2331,9 +2332,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
>  		page = __alloc_pages_slowpath(gfp_mask, order,
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
> -	put_mems_allowed();
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +
> +out:
> +	/*
> +	 * When updating a task's mems_allowed, it is possible to race with
> +	 * parallel threads in such a way that an allocation can fail while
> +	 * the mask is being updated. If a page allocation is about to fail,
> +	 * check if the cpuset changed during allocation and if so, retry.
> +	 */
> +	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> +		goto retry_cpuset;
> +
>  	return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2557,13 +2568,15 @@ void si_meminfo_node(struct sysinfo *val
>  bool skip_free_areas_node(unsigned int flags, int nid)
>  {
>  	bool ret = false;
> +	unsigned int cpuset_mems_cookie;
>  
>  	if (!(flags & SHOW_MEM_FILTER_NODES))
>  		goto out;
>  
> -	get_mems_allowed();
> -	ret = !node_isset(nid, cpuset_current_mems_allowed);
> -	put_mems_allowed();
> +	do {
> +		cpuset_mems_cookie = get_mems_allowed();
> +		ret = !node_isset(nid, cpuset_current_mems_allowed);
> +	} while (!put_mems_allowed(cpuset_mems_cookie));
>  out:
>  	return ret;
>  }
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3218,12 +3218,10 @@ static void *alternate_node_alloc(struct
>  	if (in_interrupt() || (flags & __GFP_THISNODE))
>  		return NULL;
>  	nid_alloc = nid_here = numa_mem_id();
> -	get_mems_allowed();
>  	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
>  		nid_alloc = cpuset_slab_spread_node();
>  	else if (current->mempolicy)
>  		nid_alloc = slab_node(current->mempolicy);
> -	put_mems_allowed();
>  	if (nid_alloc != nid_here)
>  		return ____cache_alloc_node(cachep, flags, nid_alloc);
>  	return NULL;
> @@ -3246,14 +3244,17 @@ static void *fallback_alloc(struct kmem_
>  	enum zone_type high_zoneidx = gfp_zone(flags);
>  	void *obj = NULL;
>  	int nid;
> +	unsigned int cpuset_mems_cookie;
>  
>  	if (flags & __GFP_THISNODE)
>  		return NULL;
>  
> -	get_mems_allowed();
> -	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>  
> +retry_cpuset:
> +	cpuset_mems_cookie = get_mems_allowed();
> +	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> +
>  retry:
>  	/*
>  	 * Look through allowed nodes for objects available
> @@ -3306,7 +3307,9 @@ retry:
>  			}
>  		}
>  	}
> -	put_mems_allowed();
> +
> +	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
> +		goto retry_cpuset;
>  	return obj;
>  }
>  
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1457,6 +1457,7 @@ static struct page *get_any_partial(stru
>  	struct zone *zone;
>  	enum zone_type high_zoneidx = gfp_zone(flags);
>  	struct page *page;
> +	unsigned int cpuset_mems_cookie;
>  
>  	/*
>  	 * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1480,22 +1481,32 @@ static struct page *get_any_partial(stru
>  			get_cycles() % 1024 > s->remote_node_defrag_ratio)
>  		return NULL;
>  
> -	get_mems_allowed();
> -	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> -		struct kmem_cache_node *n;
> +	do {
> +		cpuset_mems_cookie = get_mems_allowed();
> +		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> +		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> +			struct kmem_cache_node *n;
>  
> -		n = get_node(s, zone_to_nid(zone));
> +			n = get_node(s, zone_to_nid(zone));
>  
> -		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> -				n->nr_partial > s->min_partial) {
> -			page = get_partial_node(n);
> -			if (page) {
> -				put_mems_allowed();
> -				return page;
> +			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> +					n->nr_partial > s->min_partial) {
> +				page = get_partial_node(n);
> +				if (page) {
> +					/*
> +					 * Return the object even if
> +					 * put_mems_allowed indicated that
> +					 * the cpuset mems_allowed was
> +					 * updated in parallel. It's a
> +					 * harmless race between the alloc
> +					 * and the cpuset update.
> +					 */
> +					put_mems_allowed(cpuset_mems_cookie);
> +					return page;
> +				}
>  			}
>  		}
> -	}
> +	} while (!put_mems_allowed(cpuset_mems_cookie));
>  	put_mems_allowed();

This doesn't build on 3.0, the backport left the stray put_mems_allowed
above:

linux-stable/mm/slub.c: In function 'get_any_partial':
linux-stable/mm/slub.c:1510:2: error: too few arguments to function 'put_mems_allowed'
linux-stable/include/linux/cpuset.h:108:20: note: declared here

>  #endif
>  	return NULL;
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,7 +2251,6 @@ static unsigned long do_try_to_free_page
>  	unsigned long writeback_threshold;
>  	bool aborted_reclaim;
>  
> -	get_mems_allowed();
>  	delayacct_freepages_start();
>  
>  	if (scanning_global_lru(sc))
> @@ -2314,7 +2313,6 @@ static unsigned long do_try_to_free_page
>  
>  out:
>  	delayacct_freepages_end();
> -	put_mems_allowed();
>  
>  	if (sc->nr_reclaimed)
>  		return sc->nr_reclaimed;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe stable" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
[]'s
Herton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/