linux-kernel - Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50EA7860.6030300@jp.fujitsu.com>
Date:	Mon, 07 Jan 2013 16:25:20 +0900
From:	Kamezawa Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	Sha Zhengju <handai.szj@...il.com>
CC:	Michal Hocko <mhocko@...e.cz>, linux-kernel@...r.kernel.org,
	cgroups@...r.kernel.org, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, akpm@...ux-foundation.org,
	gthelen@...gle.com, fengguang.wu@...el.com, glommer@...allels.com,
	dchinner@...hat.com, Sha Zhengju <handai.szj@...bao.com>
Subject: Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting

(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@...e.cz> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj@...bao.com>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users.  :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>    mem_cgroup_begin_update_page_stat             <<< 1
>>    set_page_dirty
>>      __set_page_dirty_buffers
>>        __set_page_dirty
>>          mem_cgroup_begin_update_page_stat       <<< 2
>>            move_lock_mem_cgroup
>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>                                                      atomic_inc(&memcg_moving)
>>                                                      atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen <<< TRUE
>>        move_lock_mem_cgroup
>>    [...]
>>          mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<< DEADLOCK
>>    [...]
>>    mem_cgroup_end_update_page_stat
>>      rcu_unlock
>>                                                    # wake up from synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>                                                        mem_cgroup_move_account
>>                                                          move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
>          CPU0                                            CPU1
> page_remove_rmap
>                                                  mem_cgroup_can_attach
>    mem_cgroup_begin_update_page_stat (1)
>      rcu_read_lock
>                                                    mem_cgroup_start_move
>                                                      atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
>                                                      synchronize_rcu
>      __mem_cgroup_begin_update_page_stat
>        mem_cgroup_stolen   <<< TRUE
>        move_lock_mem_cgroup   <<<< read-spinlock success
>    [...]
>       mem_cgroup_begin_update_page_stat (2)
>            __mem_cgroup_begin_update_page_stat
>              mem_cgroup_stolen     <<< still TRUE
>              move_lock_mem_cgroup  <<<< read-spinlock success
>
>    [...]
>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>      rcu_unlock
>                                                    # wake up from synchronize_rcu
>                                                  [...]
>                                                  mem_cgroup_move_task
>                                                    mem_cgroup_move_charge
>                                                      walk_page_range
>                                                        mem_cgroup_move_account
>
> move_lock_mem_cgroup    <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>

rwlock will work with the nest but it seems ugly do updates under read-lock.

How about this straightforward ?
==
/*
  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
  * the memcg again for nesting calls
  */
static void move_lock_mem_cgroup(memcg, flags);
{
	current->memcg_move_lock_nested += 1;
	if (current->memcg_move_lock_nested > 1) {
		VM_BUG_ON(current->move_locked_memcg != memcg);
		return;
	}
	spin_lock_irqsave(&memcg_move_lock, &flags);
	current->move_lockdev_memcg = memcg;
}

static void move_unlock_mem_cgroup(memcg, flags)
{
	current->memcg_move_lock_nested -= 1;
	if (!current->memcg_move_lock_nested) {
		current->move_locked_memcg = NULL;
		spin_unlock_irqrestore(&memcg_move_lock,flags);
	}
}

==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/