[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf18f8341003272230l7739c182s40191cde5c8bf4de@mail.gmail.com>
Date: Sun, 28 Mar 2010 13:30:13 +0800
From: Bob Liu <lliubbo@...il.com>
To: miaox@...fujitsu.com
Cc: Nick Piggin <npiggin@...e.de>,
David Rientjes <rientjes@...gle.com>,
Lee Schermerhorn <lee.schermerhorn@...com>,
Paul Menage <menage@...gle.com>,
Linux-Kernel <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] [PATCH -mmotm] cpuset,mm: use seqlock to protect
task->mempolicy and mems_allowed (v2) (was: Re: [PATCH V2 4/4] cpuset,mm:
update task's mems_allowed lazily)
On Thu, Mar 25, 2010 at 9:33 PM, Miao Xie <miaox@...fujitsu.com> wrote:
> on 2010-3-11 19:03, Nick Piggin wrote:
>>> Ok, I try to make a new patch by using seqlock.
>>
>> Well... I do think seqlocks would be a bit simpler because they don't
>> require this checking and synchronizing of this patch.
> Hi, Nick Piggin
>
> I have made a new patch which uses seqlock to protect mems_allowed and mempolicy.
> please review it.
>
> Subject: [PATCH] [PATCH -mmotm] cpuset,mm: use seqlock to protect task->mempolicy and mems_allowed (v2)
>
> Before applying this patch, cpuset updates task->mems_allowed by setting all
> new bits in the nodemask first, and clearing all old unallowed bits later.
> But in the way, the allocator can see an empty nodemask, though it is infrequent.
>
> The problem is following:
> The size of nodemask_t is greater than the size of long integer, so loading
> and storing of nodemask_t are not atomic operations. If task->mems_allowed
> don't intersect with new_mask, such as the first word of the mask is empty
> and only the first word of new_mask is not empty. When the allocator
> loads a word of the mask before
>
> current->mems_allowed |= new_mask;
>
> and then loads another word of the mask after
>
> current->mems_allowed = new_mask;
>
> the allocator gets an empty nodemask.
>
> Besides that, if the size of nodemask_t is less than the size of long integer,
> there is another problem. when the kernel allocater invokes the following function,
>
> struct zoneref *next_zones_zonelist(struct zoneref *z,
> enum zone_type highest_zoneidx,
> nodemask_t *nodes,
> struct zone **zone)
> {
> /*
> * Find the next suitable zone to use for the allocation.
> * Only filter based on nodemask if it's set
> */
> if (likely(nodes == NULL))
> ......
> else
> while (zonelist_zone_idx(z) > highest_zoneidx ||
> (z->zone && !zref_in_nodemask(z, nodes)))
> z++;
>
> *zone = zonelist_zone(z);
> return z;
> }
>
> if we change nodemask between two calls of zref_in_nodemask(), such as
> Task1 Task2
> zref_in_nodemask(z = node0's z, nodes = 1-2)
> zref_in_nodemask return 0
> nodes = 0
> zref_in_nodemask(z = node1's z, nodes = 0)
> zref_in_nodemask return 0
> z will overflow.
>
> when the kernel allocater accesses task->mempolicy, there is the same problem.
>
> The following method is used to fix these two problem.
> A seqlock is used to protect task's mempolicy and mems_allowed for configs where
> MAX_NUMNODES > BITS_PER_LONG, and when the kernel allocater accesses nodemask,
> it locks the seqlock and gets the copy of nodemask, then it passes the copy of
> nodemask to the memory allocating function.
>
> Signed-off-by: Miao Xie <miaox@...fujitsu.com>
> ---
> include/linux/cpuset.h | 79 +++++++++++++++++++++++-
> include/linux/init_task.h | 8 +++
> include/linux/sched.h | 17 ++++-
> kernel/cpuset.c | 97 +++++++++++++++++++++++-------
> kernel/exit.c | 4 +
> kernel/fork.c | 4 +
> mm/hugetlb.c | 22 ++++++-
> mm/mempolicy.c | 144 ++++++++++++++++++++++++++++++++++-----------
> mm/slab.c | 26 +++++++-
> mm/slub.c | 12 ++++-
> 10 files changed, 341 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index a5740fc..e307f89 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -53,8 +53,8 @@ static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
> return cpuset_node_allowed_hardwall(zone_to_nid(z), gfp_mask);
> }
>
> -extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - const struct task_struct *tsk2);
> +extern int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + struct task_struct *tsk2);
>
> #define cpuset_memory_pressure_bump() \
> do { \
> @@ -90,9 +90,68 @@ extern void rebuild_sched_domains(void);
>
> extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>
> +# if MAX_NUMNODES > BITS_PER_LONG
> +/*
> + * Be used to protect task->mempolicy and mems_allowed when reading them for
> + * page allocation.
> + *
> + * we don't care that the kernel page allocator allocate a page on a node in
> + * the old mems_allowed, which isn't a big deal, especially since it was
> + * previously allowed.
> + *
> + * We just worry whether the kernel page allocator gets an empty mems_allowed
> + * or not. But
> + * if MAX_NUMNODES <= BITS_PER_LONG, loading/storing task->mems_allowed are
> + * atomic operations. So we needn't do anything to protect the loading of
> + * task->mems_allowed in fastpaths.
> + *
> + * if MAX_NUMNODES > BITS_PER_LONG, loading/storing task->mems_allowed are
> + * not atomic operations. So we use a seqlock to protect the loading of
> + * task->mems_allowed in fastpaths.
> + */
> +#define mems_fastpath_lock_irqsave(p, flags) \
> + ({ \
> + read_seqbegin_irqsave(&p->mems_seqlock, flags); \
> + })
> +
> +#define mems_fastpath_unlock_irqrestore(p, seq, flags) \
> + ({ \
> + read_seqretry_irqrestore(&p->mems_seqlock, seq, flags); \
> + })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) \
> + do { \
> + write_seqlock_irqsave(&p->mems_seqlock, flags); \
> + } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) \
> + do { \
> + write_sequnlock_irqrestore(&p->mems_seqlock, flags); \
> + } while (0)
> +# else
> +#define mems_fastpath_lock_irqsave(p, flags) ({ (void)(flags); 0; })
> +
> +#define mems_fastpath_unlock_irqrestore(p, flags) ({ (void)(flags); 0; })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) \
> + do { \
> + task_lock(p); \
> + (void)(flags); \
> + } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) \
> + do { \
> + task_unlock(p); \
> + (void)(flags); \
> + } while (0)
> +# endif
> +
> static inline void set_mems_allowed(nodemask_t nodemask)
> {
> + unsigned long flags;
> + mems_slowpath_lock_irqsave(current, flags);
> current->mems_allowed = nodemask;
> + mems_slowpath_unlock_irqrestore(current, flags);
> }
>
> #else /* !CONFIG_CPUSETS */
> @@ -144,8 +203,8 @@ static inline int cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask)
> return 1;
> }
>
> -static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - const struct task_struct *tsk2)
> +static inline int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + struct task_struct *tsk2)
> {
> return 1;
> }
> @@ -193,6 +252,18 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> {
> }
>
> +#define mems_fastpath_lock_irqsave(p, flags) \
> + ({ (void)(flags); 0; })
> +
> +#define mems_fastpath_unlock_irqrestore(p, seq, flags) \
> + ({ (void)(flags); 0; })
> +
> +#define mems_slowpath_lock_irqsave(p, flags) \
> + do { (void)(flags); } while (0)
> +
> +#define mems_slowpath_unlock_irqrestore(p, flags) \
> + do { (void)(flags); } while (0)
> +
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 1ed6797..0394e20 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -102,6 +102,13 @@ extern struct cred init_cred;
> # define INIT_PERF_EVENTS(tsk)
> #endif
>
> +#if defined(CONFIG_CPUSETS) && MAX_NUMNODES > BITS_PER_LONG
> +# define INIT_MEM_SEQLOCK(tsk) \
> + .mems_seqlock = __SEQLOCK_UNLOCKED(tsk.mems_seqlock),
> +#else
> +# define INIT_MEM_SEQLOCK(tsk)
> +#endif
> +
> /*
> * INIT_TASK is used to set up the first task table, touch at
> * your own risk!. Base=0, limit=0x1fffff (=2MB)
> @@ -171,6 +178,7 @@ extern struct cred init_cred;
> INIT_FTRACE_GRAPH \
> INIT_TRACE_RECURSION \
> INIT_TASK_RCU_PREEMPT(tsk) \
> + INIT_MEM_SEQLOCK(tsk) \
> }
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84b8c22..1cf5fd3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1356,8 +1356,9 @@ struct task_struct {
> /* Thread group tracking */
> u32 parent_exec_id;
> u32 self_exec_id;
> -/* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
> - * mempolicy */
> +/* Protection of (de-)allocation: mm, files, fs, tty, keyrings.
> + * if MAX_NUMNODES <= BITS_PER_LONG,it will protect mems_allowed and mempolicy.
> + * Or we use other seqlock - mems_seqlock to protect them. */
> spinlock_t alloc_lock;
>
> #ifdef CONFIG_GENERIC_HARDIRQS
> @@ -1425,7 +1426,13 @@ struct task_struct {
> cputime_t acct_timexpd; /* stime + utime since last update */
> #endif
> #ifdef CONFIG_CPUSETS
> - nodemask_t mems_allowed; /* Protected by alloc_lock */
> +# if MAX_NUMNODES > BITS_PER_LONG
> + /* Protection of mems_allowed, and mempolicy */
> + seqlock_t mems_seqlock;
> +# endif
> + /* if MAX_NUMNODES <= BITS_PER_LONG, Protected by alloc_lock;
> + * else Protected by mems_seqlock */
> + nodemask_t mems_allowed;
> int cpuset_mem_spread_rotor;
> #endif
> #ifdef CONFIG_CGROUPS
> @@ -1448,7 +1455,9 @@ struct task_struct {
> struct list_head perf_event_list;
> #endif
> #ifdef CONFIG_NUMA
> - struct mempolicy *mempolicy; /* Protected by alloc_lock */
> + /* if MAX_NUMNODES <= BITS_PER_LONG, Protected by alloc_lock;
> + * else Protected by mems_seqlock */
> + struct mempolicy *mempolicy;
> short il_next;
> #endif
> atomic_t fs_excl; /* holding fs exclusive resources */
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index d109467..52e6f51 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -198,12 +198,13 @@ static struct cpuset top_cpuset = {
> * from one of the callbacks into the cpuset code from within
> * __alloc_pages().
> *
> - * If a task is only holding callback_mutex, then it has read-only
> - * access to cpusets.
> + * If a task is only holding callback_mutex or cgroup_mutext, then it has
> + * read-only access to cpusets.
> *
> * Now, the task_struct fields mems_allowed and mempolicy may be changed
> - * by other task, we use alloc_lock in the task_struct fields to protect
> - * them.
> + * by other task, we use alloc_lock(if MAX_NUMNODES <= BITS_PER_LONG) or
> + * mems_seqlock(if MAX_NUMNODES > BITS_PER_LONG) in the task_struct fields
> + * to protect them.
> *
> * The cpuset_common_file_read() handlers only hold callback_mutex across
> * small pieces of code, such as when reading out possibly multi-word
> @@ -920,6 +921,10 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
> * call to guarantee_online_mems(), as we know no one is changing
> * our task's cpuset.
> *
> + * As the above comment said, no one can change current task's mems_allowed
> + * except itself. so we needn't hold lock to protect task's mems_allowed
> + * during this call.
> + *
> * While the mm_struct we are migrating is typically from some
> * other task, the task_struct mems_allowed that we are hacking
> * is for our current task, which must allocate new pages for that
> @@ -947,15 +952,13 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> * we structure updates as setting all new allowed nodes, then clearing newly
> * disallowed ones.
> *
> - * Called with task's alloc_lock held
> + * Called with mems_slowpath_lock held
> */
> static void cpuset_change_task_nodemask(struct task_struct *tsk,
> nodemask_t *newmems)
> {
> - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> - mpol_rebind_task(tsk, &tsk->mems_allowed);
> - mpol_rebind_task(tsk, newmems);
> tsk->mems_allowed = *newmems;
> + mpol_rebind_task(tsk, newmems);
> }
>
> /*
> @@ -970,6 +973,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
> struct cpuset *cs;
> int migrate;
> const nodemask_t *oldmem = scan->data;
> + unsigned long flags;
> NODEMASK_ALLOC(nodemask_t, newmems, GFP_KERNEL);
>
> if (!newmems)
> @@ -978,9 +982,9 @@ static void cpuset_change_nodemask(struct task_struct *p,
> cs = cgroup_cs(scan->cg);
> guarantee_online_mems(cs, newmems);
>
> - task_lock(p);
> + mems_slowpath_lock_irqsave(p, flags);
> cpuset_change_task_nodemask(p, newmems);
> - task_unlock(p);
> + mems_slowpath_unlock_irqrestore(p, flags);
>
> NODEMASK_FREE(newmems);
>
> @@ -1375,6 +1379,7 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
> static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> struct cpuset *cs)
> {
> + unsigned long flags;
> int err;
> /*
> * can_attach beforehand should guarantee that this doesn't fail.
> @@ -1383,9 +1388,10 @@ static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
> err = set_cpus_allowed_ptr(tsk, cpus_attach);
> WARN_ON_ONCE(err);
>
> - task_lock(tsk);
> + mems_slowpath_lock_irqsave(tsk, flags);
> cpuset_change_task_nodemask(tsk, to);
> - task_unlock(tsk);
> + mems_slowpath_unlock_irqrestore(tsk, flags);
> +
> cpuset_update_task_spread_flag(cs, tsk);
>
> }
> @@ -2233,7 +2239,15 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> */
> int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> {
> - return nodes_intersects(*nodemask, current->mems_allowed);
> + unsigned long flags, seq;
> + int retval;
> +
> + do {
> + seq = mems_fastpath_lock_irqsave(current, flags);
> + retval = nodes_intersects(*nodemask, current->mems_allowed);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + return retval;
> }
>
> /*
> @@ -2314,11 +2328,18 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> {
> const struct cpuset *cs; /* current cpuset ancestors */
> int allowed; /* is allocation in zone z allowed? */
> + unsigned long flags, seq;
>
> if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> return 1;
> might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> - if (node_isset(node, current->mems_allowed))
> +
> + do {
> + seq = mems_fastpath_lock_irqsave(current, flags);
> + allowed = node_isset(node, current->mems_allowed);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + if (allowed)
> return 1;
> /*
> * Allow tasks that have access to memory reserves because they have
> @@ -2369,9 +2390,18 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> */
> int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
> {
> + int allowed;
> + unsigned long flags, seq;
> +
> if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> return 1;
> - if (node_isset(node, current->mems_allowed))
> +
> + do {
> + seq = mems_fastpath_lock_irqsave(current, flags);
> + allowed = node_isset(node, current->mems_allowed);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> +
> + if (allowed)
> return 1;
> /*
> * Allow tasks that have access to memory reserves because they have
> @@ -2438,11 +2468,16 @@ void cpuset_unlock(void)
> int cpuset_mem_spread_node(void)
> {
> int node;
> -
> - node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
> - if (node == MAX_NUMNODES)
> - node = first_node(current->mems_allowed);
> - current->cpuset_mem_spread_rotor = node;
> + unsigned long flags, seq;
> +
> + do {
> + seq = mems_fastpath_lock_irqsave(current, flags);
> + node = next_node(current->cpuset_mem_spread_rotor,
> + current->mems_allowed);
> + if (node == MAX_NUMNODES)
> + node = first_node(current->mems_allowed);
> + current->cpuset_mem_spread_rotor = node;
> + } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
> return node;
> }
> EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> @@ -2458,10 +2493,26 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> * to the other.
> **/
>
> -int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
> - const struct task_struct *tsk2)
> +int cpuset_mems_allowed_intersects(struct task_struct *tsk1,
> + struct task_struct *tsk2)
> {
> - return nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
> + unsigned long flags1, flags2;
> + int retval;
> + struct task_struct *tsk;
> +
> + if (tsk1 > tsk2) {
> + tsk = tsk1;
> + tsk1 = tsk2;
> + tsk2 = tsk;
> + }
> +
> + mems_slowpath_lock_irqsave(tsk1, flags1);
> + mems_slowpath_lock_irqsave(tsk2, flags2);
> + retval = nodes_intersects(tsk1->mems_allowed, tsk2->mems_allowed);
> + mems_slowpath_unlock_irqrestore(tsk2, flags2);
> + mems_slowpath_unlock_irqrestore(tsk1, flags1);
> +
> + return retval;
> }
>
> /**
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 7b012a0..cbf045d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -16,6 +16,7 @@
> #include <linux/key.h>
> #include <linux/security.h>
> #include <linux/cpu.h>
> +#include <linux/cpuset.h>
> #include <linux/acct.h>
> #include <linux/tsacct_kern.h>
> #include <linux/file.h>
> @@ -649,6 +650,7 @@ static void exit_mm(struct task_struct * tsk)
> {
> struct mm_struct *mm = tsk->mm;
> struct core_state *core_state;
> + unsigned long flags;
>
> mm_release(tsk, mm);
> if (!mm)
> @@ -694,8 +696,10 @@ static void exit_mm(struct task_struct * tsk)
> /* We don't want this task to be frozen prematurely */
> clear_freeze_flag(tsk);
> #ifdef CONFIG_NUMA
> + mems_slowpath_lock_irqsave(tsk, flags);
> mpol_put(tsk->mempolicy);
> tsk->mempolicy = NULL;
> + mems_slowpath_unlock_irqrestore(tsk, flags);
> #endif
> task_unlock(tsk);
> mm_update_next_owner(mm);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index fe73f8d..591346a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -32,6 +32,7 @@
> #include <linux/capability.h>
> #include <linux/cpu.h>
> #include <linux/cgroup.h>
> +#include <linux/cpuset.h>
> #include <linux/security.h>
> #include <linux/hugetlb.h>
> #include <linux/swap.h>
> @@ -1075,6 +1076,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> p->io_context = NULL;
> p->audit_context = NULL;
> cgroup_fork(p);
> +#if defined(CONFIG_CPUSETS) && MAX_NUMNODES > BITS_PER_LONG
> + seqlock_init(&p->mems_seqlock);
> +#endif
> #ifdef CONFIG_NUMA
> p->mempolicy = mpol_dup(p->mempolicy);
> if (IS_ERR(p->mempolicy)) {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3a5aeb3..b40cc52 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -465,6 +465,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> struct page *page = NULL;
> struct mempolicy *mpol;
> nodemask_t *nodemask;
> + nodemask_t tmp_mask;
> + unsigned long seq, irqflag;
> struct zonelist *zonelist = huge_zonelist(vma, address,
> htlb_alloc_mask, &mpol, &nodemask);
> struct zone *zone;
> @@ -483,6 +485,15 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
> return NULL;
>
> + if (mpol == current->mempolicy && nodemask) {
> + do {
> + seq = mems_fastpath_lock_irqsave(current, irqflag);
> + tmp_mask = *nodemask;
> + } while (mems_fastpath_unlock_irqrestore(current,
> + seq, irqflag));
> + nodemask = &tmp_mask;
> + }
> +
Maybe you can define these to a macro or inline function, I saw
serveral similar places :-)
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> MAX_NR_ZONES - 1, nodemask) {
> nid = zone_to_nid(zone);
> @@ -1835,10 +1846,15 @@ __setup("default_hugepagesz=", hugetlb_default_setup);
> static unsigned int cpuset_mems_nr(unsigned int *array)
> {
> int node;
> - unsigned int nr = 0;
> + unsigned int nr;
> + unsigned long flags, seq;
>
> - for_each_node_mask(node, cpuset_current_mems_allowed)
> - nr += array[node];
> + do {
> + nr = 0;
> + seq = mems_fastpath_lock_irqsave(current, flags);
> + for_each_node_mask(node, cpuset_current_mems_allowed)
> + nr += array[node];
> + } while (mems_fastpath_unlock_irqrestore(current, seq, flags));
>
> return nr;
> }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index dd3f5c5..49abf11 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -187,8 +187,10 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> * parameter with respect to the policy mode and flags. But, we need to
> * handle an empty nodemask with MPOL_PREFERRED here.
> *
> - * Must be called holding task's alloc_lock to protect task's mems_allowed
> - * and mempolicy. May also be called holding the mmap_semaphore for write.
> + * Must be called using
> + * mems_slowpath_lock_irqsave()/mems_slowpath_unlock_irqrestore()
> + * to protect task's mems_allowed and mempolicy. May also be called holding
> + * the mmap_semaphore for write.
> */
> static int mpol_set_nodemask(struct mempolicy *pol,
> const nodemask_t *nodes, struct nodemask_scratch *nsc)
> @@ -344,9 +346,10 @@ static void mpol_rebind_policy(struct mempolicy *pol,
> * Wrapper for mpol_rebind_policy() that just requires task
> * pointer, and updates task mempolicy.
> *
> - * Called with task's alloc_lock held.
> + * Using
> + * mems_slowpath_lock_irqsave()/mems_slowpath_unlock_irqrestore()
> + * to protect it.
> */
> -
> void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
> {
> mpol_rebind_policy(tsk->mempolicy, new);
> @@ -644,6 +647,7 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> struct mempolicy *new, *old;
> struct mm_struct *mm = current->mm;
> NODEMASK_SCRATCH(scratch);
> + unsigned long irqflags;
> int ret;
>
> if (!scratch)
> @@ -662,10 +666,10 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> */
> if (mm)
> down_write(&mm->mmap_sem);
> - task_lock(current);
> + mems_slowpath_lock_irqsave(current, irqflags);
> ret = mpol_set_nodemask(new, nodes, scratch);
> if (ret) {
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> if (mm)
> up_write(&mm->mmap_sem);
> mpol_put(new);
> @@ -677,7 +681,7 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> if (new && new->mode == MPOL_INTERLEAVE &&
> nodes_weight(new->v.nodes))
> current->il_next = first_node(new->v.nodes);
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> if (mm)
> up_write(&mm->mmap_sem);
>
> @@ -691,7 +695,9 @@ out:
> /*
> * Return nodemask for policy for get_mempolicy() query
> *
> - * Called with task's alloc_lock held
> + * Must be called using mems_slowpath_lock_irqsave()/
> + * mems_slowpath_unlock_irqrestore() to
> + * protect it.
> */
> static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> {
> @@ -736,6 +742,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma = NULL;
> struct mempolicy *pol = current->mempolicy;
> + unsigned long irqflags;
>
> if (flags &
> ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
> @@ -745,9 +752,10 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
> if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
> return -EINVAL;
> *policy = 0; /* just so it's initialized */
> - task_lock(current);
> +
> + mems_slowpath_lock_irqsave(current, irqflags);
> *nmask = cpuset_current_mems_allowed;
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> return 0;
> }
>
> @@ -803,13 +811,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>
> err = 0;
> if (nmask) {
> + mems_slowpath_lock_irqsave(current, irqflags);
> if (mpol_store_user_nodemask(pol)) {
> *nmask = pol->w.user_nodemask;
> } else {
> - task_lock(current);
> get_policy_nodemask(pol, nmask);
> - task_unlock(current);
> }
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> }
>
> out:
> @@ -1008,6 +1016,7 @@ static long do_mbind(unsigned long start, unsigned long len,
> struct mempolicy *new;
> unsigned long end;
> int err;
> + unsigned long irqflags;
> LIST_HEAD(pagelist);
>
> if (flags & ~(unsigned long)(MPOL_MF_STRICT |
> @@ -1055,9 +1064,9 @@ static long do_mbind(unsigned long start, unsigned long len,
> NODEMASK_SCRATCH(scratch);
> if (scratch) {
> down_write(&mm->mmap_sem);
> - task_lock(current);
> + mems_slowpath_lock_irqsave(current, irqflags);
> err = mpol_set_nodemask(new, nmask, scratch);
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> if (err)
> up_write(&mm->mmap_sem);
> } else
> @@ -1408,8 +1417,10 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
> } else if (vma->vm_policy)
> pol = vma->vm_policy;
> }
> +
> if (!pol)
> pol = &default_policy;
> +
> return pol;
> }
>
> @@ -1475,7 +1486,7 @@ static unsigned interleave_nodes(struct mempolicy *policy)
> * next slab entry.
> * @policy must be protected by freeing by the caller. If @policy is
> * the current task's mempolicy, this protection is implicit, as only the
> - * task can change it's policy. The system default policy requires no
> + * task can free it's policy. The system default policy requires no
> * such protection.
> */
> unsigned slab_node(struct mempolicy *policy)
> @@ -1574,16 +1585,33 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> nodemask_t **nodemask)
> {
> struct zonelist *zl;
> + struct mempolicy policy;
> + struct mempolicy *pol;
> + unsigned long seq, irqflag;
>
> *mpol = get_vma_policy(current, vma, addr);
> *nodemask = NULL; /* assume !MPOL_BIND */
>
> - if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
> - zl = node_zonelist(interleave_nid(*mpol, vma, addr,
> + pol = *mpol;
> + if (pol == current->mempolicy) {
> + /*
> + * get_vma_policy() doesn't return NULL, so we needn't worry
> + * whether pol is NULL or not.
> + */
> + do {
> + seq = mems_fastpath_lock_irqsave(current, irqflag);
> + policy = *pol;
> + } while (mems_fastpath_unlock_irqrestore(current,
> + seq, irqflag));
> + pol = &policy;
> + }
> +
> + if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> + zl = node_zonelist(interleave_nid(pol, vma, addr,
> huge_page_shift(hstate_vma(vma))), gfp_flags);
> } else {
> - zl = policy_zonelist(gfp_flags, *mpol);
> - if ((*mpol)->mode == MPOL_BIND)
> + zl = policy_zonelist(gfp_flags, pol);
> + if (pol->mode == MPOL_BIND)
> *nodemask = &(*mpol)->v.nodes;
> }
> return zl;
> @@ -1609,11 +1637,14 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> {
> struct mempolicy *mempolicy;
> int nid;
> + unsigned long irqflags;
>
> if (!(mask && current->mempolicy))
> return false;
>
> + mems_slowpath_lock_irqsave(current, irqflags);
> mempolicy = current->mempolicy;
> +
> switch (mempolicy->mode) {
> case MPOL_PREFERRED:
> if (mempolicy->flags & MPOL_F_LOCAL)
> @@ -1633,6 +1664,8 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> BUG();
> }
>
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> +
> return true;
> }
> #endif
> @@ -1722,7 +1755,22 @@ struct page *
> alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> {
> struct mempolicy *pol = get_vma_policy(current, vma, addr);
> + struct mempolicy policy;
> struct zonelist *zl;
> + struct page *page;
> + unsigned long seq, iflags;
> +
> + if (pol == current->mempolicy) {
> + /*
> + * get_vma_policy() doesn't return NULL, so we needn't worry
> + * whether pol is NULL or not.
> + */
> + do {
> + seq = mems_fastpath_lock_irqsave(current, iflags);
> + policy = *pol;
> + } while (mems_fastpath_unlock_irqrestore(current, seq, iflags));
> + pol = &policy;
> + }
>
> if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
> unsigned nid;
> @@ -1736,15 +1784,16 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> /*
> * slow path: ref counted shared policy
> */
> - struct page *page = __alloc_pages_nodemask(gfp, 0,
> - zl, policy_nodemask(gfp, pol));
> + page = __alloc_pages_nodemask(gfp, 0, zl,
> + policy_nodemask(gfp, pol));
> __mpol_put(pol);
> return page;
> }
> /*
> * fast path: default or task policy
> */
> - return __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
> + page = __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
> + return page;
> }
>
> /**
> @@ -1761,26 +1810,37 @@ alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
> * Allocate a page from the kernel page pool. When not in
> * interrupt context and apply the current process NUMA policy.
> * Returns NULL when no page can be allocated.
> - *
> - * Don't call cpuset_update_task_memory_state() unless
> - * 1) it's ok to take cpuset_sem (can WAIT), and
> - * 2) allocating for current task (not interrupt).
> */
> struct page *alloc_pages_current(gfp_t gfp, unsigned order)
> {
> struct mempolicy *pol = current->mempolicy;
> + struct mempolicy policy;
> + struct page *page;
> + unsigned long seq, irqflags;
> +
>
> if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
> pol = &default_policy;
> -
> + else {
> + do {
> + seq = mems_fastpath_lock_irqsave(current, irqflags);
> + policy = *pol;
> + } while (mems_fastpath_unlock_irqrestore(current,
> + seq, irqflags));
> + pol = &policy;
> + }
> /*
> * No reference counting needed for current->mempolicy
> * nor system default_policy
> */
> if (pol->mode == MPOL_INTERLEAVE)
> - return alloc_page_interleave(gfp, order, interleave_nodes(pol));
> - return __alloc_pages_nodemask(gfp, order,
> - policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
> + page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> + else
> + page = __alloc_pages_nodemask(gfp, order,
> + policy_zonelist(gfp, pol),
> + policy_nodemask(gfp, pol));
> +
> + return page;
> }
> EXPORT_SYMBOL(alloc_pages_current);
>
> @@ -2026,6 +2086,7 @@ restart:
> */
> void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> {
> + unsigned long irqflags;
> int ret;
>
> sp->root = RB_ROOT; /* empty tree == default mempolicy */
> @@ -2043,9 +2104,9 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
> if (IS_ERR(new))
> goto put_free; /* no valid nodemask intersection */
>
> - task_lock(current);
> + mems_slowpath_lock_irqsave(current, irqflags);
> ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> mpol_put(mpol); /* drop our ref on sb mpol */
> if (ret)
> goto put_free;
> @@ -2200,6 +2261,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
> nodemask_t nodes;
> char *nodelist = strchr(str, ':');
> char *flags = strchr(str, '=');
> + unsigned long irqflags;
> int err = 1;
>
> if (nodelist) {
> @@ -2291,9 +2353,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
> int ret;
> NODEMASK_SCRATCH(scratch);
> if (scratch) {
> - task_lock(current);
> + mems_slowpath_lock_irqsave(current, irqflags);
> ret = mpol_set_nodemask(new, &nodes, scratch);
> - task_unlock(current);
> + mems_slowpath_unlock_irqrestore(current, irqflags);
> } else
> ret = -ENOMEM;
> NODEMASK_SCRATCH_FREE(scratch);
> @@ -2487,8 +2549,10 @@ int show_numa_map(struct seq_file *m, void *v)
> struct file *file = vma->vm_file;
> struct mm_struct *mm = vma->vm_mm;
> struct mempolicy *pol;
> + struct mempolicy policy;
> int n;
> char buffer[50];
> + unsigned long iflags, seq;
>
> if (!mm)
> return 0;
> @@ -2498,6 +2562,18 @@ int show_numa_map(struct seq_file *m, void *v)
> return 0;
>
> pol = get_vma_policy(priv->task, vma, vma->vm_start);
> + if (pol == current->mempolicy) {
> + /*
> + * get_vma_policy() doesn't return NULL, so we needn't worry
> + * whether pol is NULL or not.
> + */
> + do {
> + seq = mems_fastpath_lock_irqsave(current, iflags);
> + policy = *pol;
> + } while (mems_fastpath_unlock_irqrestore(current, seq, iflags));
> + pol = &policy;
> + }
> +
> mpol_to_str(buffer, sizeof(buffer), pol, 0);
> mpol_cond_put(pol);
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 09f1572..b8f5acb 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3282,14 +3282,24 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
> static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
> {
> int nid_alloc, nid_here;
> + unsigned long lflags, seq;
> + struct mempolicy mpol;
>
> if (in_interrupt() || (flags & __GFP_THISNODE))
> return NULL;
> +
> nid_alloc = nid_here = numa_node_id();
> if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
> nid_alloc = cpuset_mem_spread_node();
> - else if (current->mempolicy)
> - nid_alloc = slab_node(current->mempolicy);
> + else if (current->mempolicy) {
> + do {
> + seq = mems_fastpath_lock_irqsave(current, lflags);
> + mpol = *(current->mempolicy);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> +
> + nid_alloc = slab_node(&mpol);
> + }
> +
> if (nid_alloc != nid_here)
> return ____cache_alloc_node(cachep, flags, nid_alloc);
> return NULL;
> @@ -3312,11 +3322,21 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
> enum zone_type high_zoneidx = gfp_zone(flags);
> void *obj = NULL;
> int nid;
> + unsigned long lflags, seq;
> + struct mempolicy mpol;
>
> if (flags & __GFP_THISNODE)
> return NULL;
>
> - zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> + if (current->mempolicy) {
> + do {
> + seq = mems_fastpath_lock_irqsave(current, lflags);
> + mpol = *(current->mempolicy);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> + zonelist = node_zonelist(slab_node(&mpol), flags);
> + } else
> + zonelist = node_zonelist(slab_node(NULL), flags);
> +
> local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>
> retry:
> diff --git a/mm/slub.c b/mm/slub.c
> index b364844..436c521 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1345,6 +1345,8 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> struct zone *zone;
> enum zone_type high_zoneidx = gfp_zone(flags);
> struct page *page;
> + unsigned long lflags, seq;
> + struct mempolicy mpol;
>
> /*
> * The defrag ratio allows a configuration of the tradeoffs between
> @@ -1368,7 +1370,15 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
> get_cycles() % 1024 > s->remote_node_defrag_ratio)
> return NULL;
>
> - zonelist = node_zonelist(slab_node(current->mempolicy), flags);
> + if (current->mempolicy) {
> + do {
> + seq = mems_fastpath_lock_irqsave(current, lflags);
> + mpol = *(current->mempolicy);
> + } while (mems_fastpath_unlock_irqrestore(current, seq, lflags));
> + zonelist = node_zonelist(slab_node(&mpol), flags);
> + } else
> + zonelist = node_zonelist(slab_node(NULL), flags);
> +
> for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> struct kmem_cache_node *n;
>
> --
> 1.6.5.2
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>
--
Regards,
-Bob Liu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists