[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20120306122657.8e5b128d.akpm@linux-foundation.org>
Date: Tue, 6 Mar 2012 12:26:57 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Mel Gorman <mgorman@...e.de>
Cc: David Rientjes <rientjes@...gle.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Christoph Lameter <cl@...ux.com>,
Miao Xie <miaox@...fujitsu.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cpuset: mm: Reduce large amounts of memory barrier
related damage v2
On Tue, 6 Mar 2012 13:27:35 +0000
Mel Gorman <mgorman@...e.de> wrote:
> Changelog since V1
> o Use seqcount with rmb instead of atomics (Peter, Christoph)
>
> Commit [c0ff7453: cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems] wins a super prize for the largest number of memory
> barriers entered into fast paths for one commit. [get|put]_mems_allowed
> is incredibly heavy with pairs of full memory barriers inserted into a
> number of hot paths. This was detected while investigating at large page
> allocator slowdown introduced some time after 2.6.32. The largest portion
> of this overhead was shown by oprofile to be at an mfence introduced by
> this commit into the page allocator hot path.
>
> For extra style points, the commit introduced the use of yield() in an
> implementation of what looks like a spinning mutex.
>
> This patch replaces the full memory barriers on both read and write sides
> with a sequence counter with just read barriers on the fast path side.
> This is much cheaper on some architectures, including x86. The main bulk
> of the patch is the retry logic if the nodemask changes in a manner that
> can cause a false failure.
>
> While updating the nodemask, a check is made to see if a false failure is
> a risk. If it is, the sequence number gets bumped and parallel allocators
> will briefly stall while the nodemask update takes place.
>
> In a page fault test microbenchmark, oprofile samples from
> __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual
> results were
>
> 3.3.0-rc3 3.3.0-rc3
> rc3-vanilla nobarrier-v2r1
> Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
> Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
> Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
> Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
> Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
> Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
> Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
> Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
> Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
> Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
> Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
> Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
> Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
> Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
> Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
> MMTests Statistics: duration
> Sys Time Running Test (seconds) 135.68 132.17
> User+Sys Time Running Test (seconds) 164.2 160.13
> Total Elapsed Time (seconds) 123.46 120.87
>
> The overall improvement is small but the System CPU time is much improved
> and roughly in correlation to what oprofile reported (these performance
> figures are without profiling so skew is expected). The actual number of
> page faults is noticeably improved.
>
> For benchmarks like kernel builds, the overall benefit is marginal but
> the system CPU time is slightly reduced.
>
> To test the actual bug the commit fixed I opened two terminals. The first
> ran within a cpuset and continually ran a small program that faulted 100M
> of anonymous data. In a second window, the nodemask of the cpuset was
> continually randomised in a loop. Without the commit, the program would
> fail every so often (usually within 10 seconds) and obviously with the
> commit everything worked fine. With this patch applied, it also worked
> fine so the fix should be functionally equivalent.
>
>
> ...
>
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -92,33 +92,19 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> * reading current mems_allowed and mempolicy in the fastpath must protected
> * by get_mems_allowed()
> */
> -static inline void get_mems_allowed(void)
> +static inline unsigned int get_mems_allowed(void)
> {
> - current->mems_allowed_change_disable++;
> -
> - /*
> - * ensure that reading mems_allowed and mempolicy happens after the
> - * update of ->mems_allowed_change_disable.
> - *
> - * the write-side task finds ->mems_allowed_change_disable is not 0,
> - * and knows the read-side task is reading mems_allowed or mempolicy,
> - * so it will clear old bits lazily.
> - */
> - smp_mb();
> + return read_seqcount_begin(¤t->mems_allowed_seq);
> }
Perhaps we could tickle up the interface documentation? The current
"documentation" is a grammatical mess and has a typo.
> -static inline void put_mems_allowed(void)
> +/*
> + * If this returns false, the operation that took place after get_mems_allowed
> + * may have failed. It is up to the caller to retry the operation if
> + * appropriate
> + */
> +static inline bool put_mems_allowed(unsigned int seq)
> {
> - /*
> - * ensure that reading mems_allowed and mempolicy before reducing
> - * mems_allowed_change_disable.
> - *
> - * the write-side task will know that the read-side task is still
> - * reading mems_allowed or mempolicy, don't clears old bits in the
> - * nodemask.
> - */
> - smp_mb();
> - --ACCESS_ONCE(current->mems_allowed_change_disable);
> + return !read_seqcount_retry(¤t->mems_allowed_seq, seq);
> }
>
> static inline void set_mems_allowed(nodemask_t nodemask)
How come set_mems_allowed() still uses task_lock()?
> @@ -234,12 +220,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> {
> }
>
> -static inline void get_mems_allowed(void)
> +static inline unsigned int get_mems_allowed(void)
> {
> + return 0;
> }
>
> -static inline void put_mems_allowed(void)
> +static inline bool put_mems_allowed(unsigned int seq)
> {
> + return true;
> }
>
> #endif /* !CONFIG_CPUSETS */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7d379a6..a0bb87a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1498,7 +1498,7 @@ struct task_struct {
> #endif
> #ifdef CONFIG_CPUSETS
> nodemask_t mems_allowed; /* Protected by alloc_lock */
> - int mems_allowed_change_disable;
> + seqcount_t mems_allowed_seq; /* Seqence no to catch updates */
mems_allowed_seq never gets initialised. That happens to be OK as
we're never using its spinlock. But that's sloppy, and adding an
initialisation to INIT_TASK() is free. But will copying a spinlock by
value upset lockdep? To be fully anal we should run seqlock_init()
against each new task_struct.
> int cpuset_mem_spread_rotor;
> int cpuset_slab_spread_rotor;
> #endif
>
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -498,12 +498,15 @@ struct page *__page_cache_alloc(gfp_t gfp)
> {
> int n;
> struct page *page;
> + unsigned int cpuset_mems_cookie;
>
> if (cpuset_do_page_mem_spread()) {
> - get_mems_allowed();
> - n = cpuset_mem_spread_node();
> - page = alloc_pages_exact_node(n, gfp, 0);
> - put_mems_allowed();
> + do {
> + cpuset_mems_cookie = get_mems_allowed();
> + n = cpuset_mem_spread_node();
> + page = alloc_pages_exact_node(n, gfp, 0);
> + } while (!put_mems_allowed(cpuset_mems_cookie) && !page);
It would be a little tidier to move cpuset_mems_cookie's scope inwards.
> return page;
> }
> return alloc_pages(gfp, 0);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5f34bd8..5f1e959 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -460,8 +460,10 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct zonelist *zonelist;
> struct zone *zone;
> struct zoneref *z;
> + unsigned int cpuset_mems_cookie;
>
> - get_mems_allowed();
> +retry_cpuset:
> + cpuset_mems_cookie = get_mems_allowed();
> zonelist = huge_zonelist(vma, address,
> htlb_alloc_mask, &mpol, &nodemask);
> /*
> @@ -490,7 +492,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> }
> err:
> mpol_cond_put(mpol);
> - put_mems_allowed();
> + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> + goto retry_cpuset;
> return page;
> }
We didn't really want to retry the allocation if dequeue_huge_page_vma() has
made one of its "goto err" decisions.
>
> ...
>
> @@ -2416,9 +2417,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_slowpath(gfp_mask, order,
> zonelist, high_zoneidx, nodemask,
> preferred_zone, migratetype);
> - put_mems_allowed();
>
> trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> +
> +out:
> + /*
> + * When updating a tasks mems_allowed, it is possible to race with
"task's"
> + * parallel threads in such a way that an allocation can fail while
> + * the mask is being updated. If a page allocation is about to fail,
> + * check if the cpuset changed during allocation and if so, retry.
> + */
> + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
> + goto retry_cpuset;
> +
> return page;
> }
> EXPORT_SYMBOL(__alloc_pages_nodemask);
>
> ...
>
> @@ -3312,11 +3310,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> enum zone_type high_zoneidx = gfp_zone(flags);
> void *obj = NULL;
> int nid;
> + unsigned int cpuset_mems_cookie;
>
> if (flags & __GFP_THISNODE)
> return NULL;
>
> - get_mems_allowed();
> +retry_cpuset:
> + cpuset_mems_cookie = get_mems_allowed();
> +
> zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>
> @@ -3372,7 +3373,9 @@ retry:
> }
> }
> }
> - put_mems_allowed();
> +
> + if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
> + goto retry_cpuset;
We recalculate `zonelist' and `local_flags' each time around the loop.
The former is probably unnecessary and the latter is surely so. I'd
expect gcc to fix the `local_flags' one.
> return obj;
> }
>
> ...
>
> @@ -1604,23 +1605,24 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags,
> get_cycles() % 1024 > s->remote_node_defrag_ratio)
> return NULL;
>
> - get_mems_allowed();
> - zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> - struct kmem_cache_node *n;
> -
> - n = get_node(s, zone_to_nid(zone));
> -
> - if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> - n->nr_partial > s->min_partial) {
> - object = get_partial_node(s, n, c);
> - if (object) {
> - put_mems_allowed();
> - return object;
> + do {
> + cpuset_mems_cookie = get_mems_allowed();
> + zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> + struct kmem_cache_node *n;
> +
> + n = get_node(s, zone_to_nid(zone));
> +
> + if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> + n->nr_partial > s->min_partial) {
> + object = get_partial_node(s, n, c);
> + if (object) {
> + put_mems_allowed(cpuset_mems_cookie);
> + return object;
Confused. If put_mems_allowed() returned false, doesn't that mean the
result is unstable and we should retry? Needs a comment explaining
what's going on?
> + }
> }
> }
> - }
> - put_mems_allowed();
> + } while (!put_mems_allowed(cpuset_mems_cookie));
> #endif
> return NULL;
> }
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists