[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47ECE662.3060506@linux.vnet.ibm.com>
Date: Fri, 28 Mar 2008 18:06:50 +0530
From: Balbir Singh <balbir@...ux.vnet.ibm.com>
To: Paul Menage <menage@...gle.com>
CC: Pavel Emelianov <xemul@...nvz.org>,
Hugh Dickins <hugh@...itas.com>,
Sudhir Kumar <skumar@...ux.vnet.ibm.com>,
YAMAMOTO Takashi <yamamoto@...inux.co.jp>, lizf@...fujitsu.com,
linux-kernel@...r.kernel.org, taka@...inux.co.jp,
linux-mm@...ck.org, David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [-mm] Add an owner to the mm_struct (v2)
Paul Menage wrote:
> On Fri, Mar 28, 2008 at 1:23 AM, Balbir Singh <balbir@...ux.vnet.ibm.com> wrote:
>> diff -puN include/linux/mm_types.h~memory-controller-add-mm-owner include/linux/mm_types.h
>> --- linux-2.6.25-rc5/include/linux/mm_types.h~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/include/linux/mm_types.h 2008-03-28 12:26:59.000000000 +0530
>> @@ -227,8 +227,10 @@ struct mm_struct {
>> /* aio bits */
>> rwlock_t ioctx_list_lock;
>> struct kioctx *ioctx_list;
>> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> - struct mem_cgroup *mem_cgroup;
>> +#ifdef CONFIG_MM_OWNER
>> + spinlock_t owner_lock;
>> + struct task_struct *owner; /* The thread group leader that */
>> + /* owns the mm_struct. */
>> #endif
>
> I'm not convinced that we need the spinlock. Just use the simple rule
> that you can only modify mm->owner if:
>
> - mm->owner points to current
> - the new owner is a user of mm
This will always hold, otherwise it cannot be the new owner :)
> - you hold task_lock() for the new owner (which is necessary anyway to
> ensure that the new owner's mm doesn't change while you're updating
> mm->owner)
>
tsk->mm should not change unless the task is exiting or when a kernel thread
does use_mm() (PF_BORROWED_MM).
I see mm->owner changing when
1. The mm->owner exits
2. At fork time for clone calls with CLONE_VM
May be your rules will work, but I am yet to try that out.
>> #ifdef CONFIG_PROC_FS
>> diff -puN kernel/fork.c~memory-controller-add-mm-owner kernel/fork.c
>> --- linux-2.6.25-rc5/kernel/fork.c~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/kernel/fork.c 2008-03-28 12:33:12.000000000 +0530
>> @@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct
>> mm->free_area_cache = TASK_UNMAPPED_BASE;
>> mm->cached_hole_size = ~0UL;
>> mm_init_cgroup(mm, p);
>> + mm_init_owner(mm, p);
>>
>> if (likely(!mm_alloc_pgd(mm))) {
>> mm->def_flags = 0;
>> @@ -995,6 +996,27 @@ static void rt_mutex_init_task(struct ta
>> #endif
>> }
>>
>> +#ifdef CONFIG_MM_OWNER
>> +void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>> +{
>> + spin_lock_init(&mm->owner_lock);
>> + mm->owner = p;
>> +}
>> +
>> +void mm_fork_init_owner(struct task_struct *p)
>> +{
>> + struct mm_struct *mm = get_task_mm(p);
>
> Do we need this? p->mm can't go away if we're in the middle of forking it.
>
There are other cases why I preferred to use it, specifically for PF_BORROWED_MM
cases. What if a kernel thread does use_mm() and fork()? Very unlikely, I'll
remove it and use p->mm directly.
>> + if (!mm)
>> + return;
>> +
>> + spin_lock(&mm->owner);
>
> I suspect that you meant this to be spin_lock(&mm->owner_lock).
>
Yes, thats a typo. A bad one :)
>> + if (mm->owner != p)
>> + rcu_assign_pointer(mm->owner, p->group_leader);
>> + spin_unlock(&mm->owner);
>> + mmput(mm);
>> +}
>> +#endif /* CONFIG_MM_OWNER */
>> +
>> /*
>> * This creates a new process as a copy of the old one,
>> * but does not actually start it yet.
>> @@ -1357,6 +1379,10 @@ static struct task_struct *copy_process(
>> write_unlock_irq(&tasklist_lock);
>> proc_fork_connector(p);
>> cgroup_post_fork(p);
>> +
>> + if (!(clone_flags & CLONE_VM) && (p != p->group_leader))
>> + mm_fork_init_owner(p);
>> +
>
> I'm not sure I understand what this is doing.
>
> I read it as "if p has its own mm and p is a child thread, set
> p->mm->owner to p->group_leader". But by definition if p has its own
> mm, then p->group_leader->mm will be different to p->mm, therefore
> we'd end up with mm->owner->mm != mm, which seems very bad.
>
> What's the intention of this bit of code?
>
The intention is to handle the case when clone is called without CLONE_VM and
with CLONE_THREAD. This means that p can have it's own mm and a shared group_leader.
>> return p;
>>
>> bad_fork_free_pid:
>> diff -puN include/linux/memcontrol.h~memory-controller-add-mm-owner include/linux/memcontrol.h
>> --- linux-2.6.25-rc5/include/linux/memcontrol.h~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/include/linux/memcontrol.h 2008-03-28 09:30:47.000000000 +0530
>> @@ -29,6 +29,7 @@ struct mm_struct;
>>
>> extern void mm_init_cgroup(struct mm_struct *mm, struct task_struct *p);
>> extern void mm_free_cgroup(struct mm_struct *mm);
>> +extern void mem_cgroup_fork_init(struct task_struct *p);
>>
>> #define page_reset_bad_cgroup(page) ((page)->page_cgroup = 0)
>>
>> @@ -49,7 +50,7 @@ extern void mem_cgroup_out_of_memory(str
>> int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
>>
>> #define mm_match_cgroup(mm, cgroup) \
>> - ((cgroup) == rcu_dereference((mm)->mem_cgroup))
>> + ((cgroup) == mem_cgroup_from_task((mm)->owner))
>>
>> extern int mem_cgroup_prepare_migration(struct page *page);
>> extern void mem_cgroup_end_migration(struct page *page);
>> @@ -72,6 +73,8 @@ extern long mem_cgroup_calc_reclaim_acti
>> extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
>> struct zone *zone, int priority);
>>
>> +extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>> +
>> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>> static inline void mm_init_cgroup(struct mm_struct *mm,
>> struct task_struct *p)
>> @@ -82,6 +85,10 @@ static inline void mm_free_cgroup(struct
>> {
>> }
>>
>> +static inline void mem_cgroup_fork_init(struct task_struct *p)
>> +{
>> +}
>> +
>
> Is this stale?
>
Yes, there is some stale code in mm/memcontrol.c and include/linux/memcontrol.h
(my bad)
>> static inline void page_reset_bad_cgroup(struct page *page)
>> {
>> }
>> @@ -172,6 +179,11 @@ static inline long mem_cgroup_calc_recla
>> {
>> return 0;
>> }
>> +
>> +static void mm_free_fork_cgroup(struct task_struct *p)
>> +{
>> +}
>> +
>
> And this?
>
Yes
>> #endif /* CONFIG_CGROUP_MEM_CONT */
>> -static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>> +struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>> {
>> return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
>> struct mem_cgroup, css);
>
> I think it would be better to make this static inline in the header
> file - it's just two indexed dereferences, so hardly worth the
> function call overhead.
>
Yes, will do
>> @@ -250,12 +250,17 @@ void mm_init_cgroup(struct mm_struct *mm
>>
>> mem = mem_cgroup_from_task(p);
>> css_get(&mem->css);
>> - mm->mem_cgroup = mem;
>> }
>>
>> void mm_free_cgroup(struct mm_struct *mm)
>> {
>> - css_put(&mm->mem_cgroup->css);
>> + struct mem_cgroup *mem;
>> +
>> + /*
>> + * TODO: Should we assign mm->owner to NULL here?
>
> No, controller code shouldn't be changing mm->owner.
>
> And surely we don't need mm_init_cgroup() and mm_free_cgroup() any longer?
>
We don't need it and the comment is stale :)
>> - rcu_read_lock();
>> - mem = rcu_dereference(mm->mem_cgroup);
>> + mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>> /*
>> * For every charge from the cgroup, increment reference count
>> */
>> css_get(&mem->css);
>> - rcu_read_unlock();
>
> Why is it OK to take away the rcu_read_lock() here? We're still doing
> an rcu_dereference().
>
Nope, my bad -- stale code. Will fix
>> while (res_counter_charge(&mem->res, PAGE_SIZE)) {
>> if (!(gfp_mask & __GFP_WAIT))
>> @@ -990,8 +994,8 @@ mem_cgroup_create(struct cgroup_subsys *
>>
>> if (unlikely((cont->parent) == NULL)) {
>> mem = &init_mem_cgroup;
>> - init_mm.mem_cgroup = mem;
>> page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
>> + init_mm.owner = &init_task;
>
> This shouldn't be in here - it should be in the core code that sets up init_mm.
>
Yes, good point. Will do
>> +#ifdef CONFIG_MM_OWNER
>> +extern void mm_update_next_owner(struct mm_struct *mm, struct task_struct *p);
>> +extern void mm_fork_init_owner(struct task_struct *p);
>> +extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
>> +#else
>> +static inline void
>> +mm_update_next_owner(struct mm_struct *mm, struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline void mm_fork_init_owner(struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>> +{
>> +}
>> +#endif /* CONFIG_MM_OWNER */
>> +
>> #endif /* __KERNEL__ */
>>
>> #endif
>> diff -puN kernel/exit.c~memory-controller-add-mm-owner kernel/exit.c
>> --- linux-2.6.25-rc5/kernel/exit.c~memory-controller-add-mm-owner 2008-03-28 09:30:47.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/kernel/exit.c 2008-03-28 12:35:39.000000000 +0530
>> @@ -579,6 +579,71 @@ void exit_fs(struct task_struct *tsk)
>>
>> EXPORT_SYMBOL_GPL(exit_fs);
>>
>> +#ifdef CONFIG_MM_OWNER
>> +/*
>> + * Task p is exiting and it owned p, so lets find a new owner for it
>> + */
>> +static inline int
>> +mm_need_new_owner(struct mm_struct *mm, struct task_struct *p)
>> +{
>> + int ret;
>> +
>> + rcu_read_lock();
>> + ret = (mm && (rcu_dereference(mm->owner) == p) &&
>> + (atomic_read(&mm->mm_users) > 1));
>> + rcu_read_unlock();
>> + return ret;
>
> The only way that rcu_read_lock() helps here is if mm freeing is
> protected by RCU, which I don't think is the case.
>
rcu_read_lock() also ensures that preemption does not cause us to see incorrect
values.
> But as long as p==current, there's no race, since no other process
> will re-point mm->owner at themselves, so mm can't go away anyway
> since we have a reference to it that we're going to be dropping soon.
>
mm cannot go away, but mm->owner can be different from current and could be
going away.
> Is there ever a case where we'd want to call this on anything other
> than current? It would simplify the code to just refer to current
> rather than tsk.
>
Not at the moment, yes, we can remove the second parameter
>> +}
>> +
>> +void mm_update_next_owner(struct mm_struct *mm, struct task_struct *p)
>> +{
>> + struct task_struct *c, *g;
>> +
>> + /*
>> + * This should not be called for init_task
>> + */
>> + BUG_ON(p == p->parent);
>
> I'd be inclined to make this BUG_ON(p != current), or just have p as a
> local variable initialized from current. (If you're trying to save
> multiple calls to current on arches where it's not just a simple
> register).
>
OK
>> +
>> + if (!mm_need_new_owner(mm, p))
>> + return;
>> +
>> + /*
>> + * Search in the children
>> + */
>> + list_for_each_entry(c, &p->children, sibling) {
>> + if (c->mm == p->mm)
>> + goto assign_new_owner;
>> + }
>
> We need to keep checking mm_need_new_owner() since it can become false
> if the only other user of the mm exits at the same time that we do.
> (In which case there's nothing to do).
>
I would rather deal with the case where mm->owner is NULL, rather than keep
checking (since even with constant checking we cannot guarantee that mm->owner
will not become NULL)
>> + * Search through everything else. We should not get
>> + * here often
>> + */
>> + for_each_process(c) {
>> + g = c;
>> + do {
>> + if (c->mm && (c->mm == p->mm))
>> + goto assign_new_owner;
>> + } while ((c = next_thread(c)) != g);
>> + }
>
> Is there a reason to not code this as for_each_thread?
>
Is there a for_each_thread()?
>> +
>> + BUG();
>> +
>> +assign_new_owner:
>> + spin_lock(&mm->owner_lock);
>> + rcu_assign_pointer(mm->owner, c);
>> + spin_unlock(&mm->owner_lock);
>> +}
>
> This can break if c is also exiting and has passed the call to
> mm_update_next_owner() by the time we assign mm->owner. That's why my
> original suggested version had a function like:
>
Won't it better to check for c->flags & PF_EXITING?
> static inline void try_give_mm_ownership(struct task_struct *task,
> struct mm_struct *mm) {
> if (task->mm != mm) return;
> task_lock(task);
> if (task->mm == mm) {
> mm->owner = task;
> }
> task_unlock(task);
> }
>
> i.e. determining that a task is a valid candidate and updating the
> owner pointer has to be done in the same critical section.
>
Let me try and revamp the locking rules and see what that leads to. But, I don't
like protecting an mm_struct's member with a task_struct's lock.
> Also, looking forward to when we have the virtual AS limits
> controller, in the (unlikely?) event that the new owner is in a
> different virtual AS limit control group, this code will need to be
> able to handle shifting the mm->total_mm from the old AS cgroup to the
> new one. That's the "fiddly layer violation" that I mentioned earlier.
>
> It might be cleaner to be able to specify on a per-subsystem basis
> whether we require that all users of an mm be in the same cgroup.
>
I don't expect to see that kind of restriction.
>> config CGROUP_MEM_RES_CTLR
>> bool "Memory Resource Controller for Control Groups"
>> - depends on CGROUPS && RESOURCE_COUNTERS
>> + depends on CGROUPS && RESOURCE_COUNTERS && MM_OWNER
>
> Maybe this should select MM_OWNER rather than depending on it?
>
I thought of it, but wondered if the user should make an informed choice about
MM_OWNER and the overhead it brings along.
> Paul
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists