linux-kernel - Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xr93k4rxx6sd.fsf@ninji.mtv.corp.google.com>
Date:	Fri, 23 Apr 2010 13:17:38 -0700
From:	Greg Thelen <gthelen@...gle.com>
To:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc:	Daisuke Nishimura <nishimura@....nes.nec.co.jp>,
	Vivek Goyal <vgoyal@...hat.com>, balbir@...ux.vnet.ibm.com,
	Andrea Righi <arighi@...eler.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Trond Myklebust <trond.myklebust@....uio.no>,
	Suleiman Souhlal <suleiman@...gle.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Andrew Morton <akpm@...ux-foundation.org>,
	containers@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

On Wed, Apr 14, 2010 at 11:54 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> On Thu, 15 Apr 2010 15:21:04 +0900
> Daisuke Nishimura <nishimura@....nes.nec.co.jp> wrote:
>> > The only reason to use trylock in this case is to prevent deadlock
>> > when running in a context that may have preempted or interrupted a
>> > routine that already holds the bit locked.  In the
>> > __remove_from_page_cache() irqs are disabled, but that does not imply
>> > that a routine holding the spinlock has been preempted.  When the bit
>> > is locked, preemption is disabled.  The only way to interrupt a holder
>> > of the bit for an interrupt to occur (I/O, timer, etc).  So I think
>> > that in_interrupt() is sufficient.  Am I missing something?
>> >
>> IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire
>> the same page cgroup lock. But there is still some possibility where 2 CPUs
>> can cause dead lock each other(please see the commit e767e056).
>> IOW, my point is "don't call lock_page_cgroup() under mapping->tree_lock".
>>
> Hmm, maybe worth to try. We may be able to set/clear all DIRTY/WRITBACK bit
> on page_cgroup without mapping->tree_lock.
> In such case, of course, the page itself should be locked by lock_page().
>
> But.Hmm..for example.
>
> account_page_dirtied() is the best place to mark page_cgroup dirty. But
> it's called under mapping->tree_lock.
>
> Another thinking:
> I wonder we may have to change our approach for dirty page acccounting.
>
> Please see task_dirty_inc(). It's for per task dirty limiting.
> And you'll notice soon that there is no task_dirty_dec().

Hello Kame-san,

This is an interesting idea.  If this applies to memcg dirty accounting,
then would it also apply to system-wide dirty accounting?  I don't think
so, but I wanted to float the idea.  It looks like this proportions.c
code is good is at comparing the rates of events (for example: per-task
dirty page events).  However, in the case of system-wide dirty
accounting we also want to consider the amount of dirty memory, not just
the rate at which it is being dirtied.

Cgroup dirty page accounting imposes the following additional accounting
complexities:
* hierarchical accounting
* page migration between cgroups

For per-memcg dirty accounting, are you thinking that each mem_cgroup
would have a struct prop_local_single to represent a memcg's dirty
memory usage relative to a system wide prop_descriptor?

My concern is that we will still need an efficient way to determine the
mem_cgroup associated with a page under a variety of conditions (page
fault path for new mappings, softirq for dirty page writeback).

Currently -rc4 and -mmotm use a non-irq safe lock_page_cgroup() to
protect a page's cgroup membership.  I think this will cause problems as
we add more per-cgroup stats (dirty page counts, etc) that are adjusted
in irq handlers.  Proposed approaches include:
1. use try-style locking.  this can lead to fuzzy counters, which some
   do not like.  Over time these fuzzy counter may drift.

2. mask irq when calling lock_page_cgroup().  This has some performance
   cost, though it may be small (see below).

3. because a page's cgroup membership rarely changes, use RCU locking.
   This is fast, but may be more complex than we want.

The performance of simple irqsave locking or more advanced RCU locking
is similar to current locking (non-irqsave/non-rcu) for several
workloads (kernel build, dd).  Using a micro-benchmark some differences
are seen:
* irqsave is 1% slower than mmotm non-irqsave/non-rcu locking.
* RCU locking is 4% faster than mmotm non-irqsave/non-rcu locking.
* RCU locking is 5% faster than irqsave locking.

I think we need some changes to per-memcg dirty page accounting updates
from irq handlers.  If we want to focus micro benchmark performance,
then RCU locking seems like the correct approach.  Otherwise, irqsave
locking seems adequate.  I'm thinking that for now we should start
simple and use irqsave.  Comments?

Here's the data I collected...

config      kernel_build[1]   dd[2]   read-fault[3]
===================================================
2.6.34-rc4  4:18.64, 4:56.06(+-0.190%)
MEMCG=n                       0.276(+-1.298%), 0.532(+-0.808%), 2.659(+-0.869%)
                                      3753.6(+-0.105%)

2.6.34-rc4  4:19.60, 4:58.29(+-0.184%)
MEMCG=y                       0.288(+-0.663%), 0.599(+-1.857%), 2.841(+-1.020%)
root cgroup                           4172.3(+-0.074%)

2.6.34-rc4  5:02.41, 4:58.56(+-0.116%)
MEMCG=y                       0.288(+-0.978%), 0.571(+-1.005%), 2.898(+-1.052%)
non-root cgroup                       4162.8(+-0.317%)

2.6.34-rc4  4:21.02, 4:57.27(+-0.152%)
MEMCG=y                       0.289(+-0.809%), 0.574(+-1.013%), 2.856(+-0.909%)
mmotm                                 4159.0(+-0.280%)
root cgroup

2.6.34-rc4  5:01.13, 4:56.84(+-0.074%)
MEMCG=y                       0.299(+-1.512%), 0.577(+-1.158%), 2.864(+-1.012%)
mmotm                                 4202.3(+-0.149%)
non-root cgroup

2.6.34-rc4  4:19.44, 4:57.30(+-0.151%)
MEMCG=y                       0.293(+-0.885%), 0.578(+-0.967%), 2.878(+-1.026%)
mmotm                                 4219.1(+-0.007%)
irqsave locking
root cgroup

2.6.34-rc4  5:01.07, 4:58.62(+-0.796%)
MEMCG=y                       0.305(+-1.752%), 0.579(+-1.035%), 2.893(+-1.111%)
mmotm                                 4254.3(+-0.095%)
irqsave locking
non-root cgroup

2.6.34-rc4  4:19.53, 4:58.74(+-0.840%)
MEMCG=y                       0.291(+-0.394%), 0.577(+-1.219%), 2.868(+-1.033%)
mmotm                                 4004.4(+-0.059%)
RCU locking
root cgroup

2.6.34-rc4  5:00.99, 4:57.04(+-0.069%)
MEMCG=y                       0.289(+-1.027%), 0.575(+-1.069%), 2.858(+-1.102%)
mmotm                                 4004.0(+-0.096%)
RCU locking
non-root cgroup

[1] kernel build is listed as two numbers, first build is cache cold,
    and average of three non-first builds (with warm cache).  src and
    output are in 2G tmpfs.

[2] dd creates 10x files in tmpfs of various sizes (100M,200M,1000M) using:
    "dd if=/dev/zero bs=$((1<<20)) count=..."

[3] micro benchmark measures cycles (rdtsc) per read fault of mmap-ed
    file warm in the page cache.

[4] MEMCG= is an abberviation for CONFIG_CGROUP_MEM_RES_CTLR=

[5] mmotm is dated 2010-04-15-14-42

[6] irqsave locking converts all [un]lock_page_cgroup() to use
    local_irq_save/restore().
    (local commit a7f01d96417b10058a2128751fe4062e8a3ecc53).  This was
    previously proposed on linux-kernel and linux-mm.

[7] RCU locking patch is shown below.
    (local commit 231a4fec6ccdef9e630e184c0e0527c884eac57d)

For reference, here's the RCU locking patch for 2010-04-15-14-42 mmotm,
which patches 2.6.34-rc4.

  Use RCU to avoid lock_page_cgroup() in most situations.

  When locking, disable irq to allow for accounting from irq handlers.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee3b52f..cd46474 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -280,6 +280,12 @@ static bool move_file(void)
 }
 
 /*
+ * If accounting changes are underway, then access to the mem_cgroup field
+ * within struct page_cgroup requires locking.
+ */
+static bool mem_cgroup_account_move_ongoing;
+
+/*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
  */
@@ -1436,12 +1442,25 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
+	bool locked = false;
+	unsigned long flags = 0;
 
 	pc = lookup_page_cgroup(page);
 	if (unlikely(!pc))
 		return;
 
-	lock_page_cgroup(pc);
+	/*
+	 * Unless a page's cgroup reassignment is possible, then avoid grabbing
+	 * the lock used to protect the cgroup assignment.
+	 */
+	rcu_read_lock();
+	smp_rmb();
+	if (unlikely(mem_cgroup_account_move_ongoing)) {
+		local_irq_save(flags);
+		lock_page_cgroup(pc);
+		locked = true;
+	}
+
 	mem = pc->mem_cgroup;
 	if (!mem || !PageCgroupUsed(pc))
 		goto done;
@@ -1449,6 +1468,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
 	 */
+	VM_BUG_ON(preemptible());
 	if (val > 0) {
 		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
@@ -1458,7 +1478,11 @@ void mem_cgroup_update_file_mapped(struct page *page, int val)
 	}
 
 done:
-	unlock_page_cgroup(pc);
+	if (unlikely(locked)) {
+		unlock_page_cgroup(pc);
+		local_irq_restore(flags);
+	}
+	rcu_read_unlock();
 }
 
 /*
@@ -2498,6 +2522,28 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 #endif
 
 /*
+ * Reassignment of mem_cgroup is possible, so locking is required.  Make sure
+ * that locks are used when accessing mem_cgroup.
+ * mem_cgroup_end_page_cgroup_reassignment() balances this function.
+ */
+static void mem_cgroup_begin_page_cgroup_reassignment(void)
+{
+	VM_BUG_ON(mem_cgroup_account_move_ongoing);
+	mem_cgroup_account_move_ongoing = true;
+	synchronize_rcu();
+}
+
+/*
+ * Once page cgroup membership changes complete, this routine indicates that
+ * access to mem_cgroup does not require locks.
+ */
+static void mem_cgroup_end_page_cgroup_reassignment(void)
+{
+	VM_BUG_ON(! mem_cgroup_end_page_cgroup_reassignment);
+	mem_cgroup_account_move_ongoing = false;
+}
+
+/*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
@@ -2524,6 +2570,10 @@ int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 		css_put(&mem->css);
 	}
 	*ptr = mem;
+
+	if (!ret)
+		mem_cgroup_begin_page_cgroup_reassignment();
+
 	return ret;
 }
 
@@ -2536,7 +2586,8 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	enum charge_type ctype;
 
 	if (!mem)
-		return;
+		goto unlock;
+
 	cgroup_exclude_rmdir(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
@@ -2583,6 +2634,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	 * In that case, we need to call pre_destroy() again. check it here.
 	 */
 	cgroup_release_and_wakeup_rmdir(&mem->css);
+
+unlock:
+	mem_cgroup_end_page_cgroup_reassignment();
 }
 
 /*
@@ -4406,6 +4460,8 @@ static void mem_cgroup_clear_mc(void)
 	mc.to = NULL;
 	mc.moving_task = NULL;
 	wake_up_all(&mc.waitq);
+
+	mem_cgroup_end_page_cgroup_reassignment();
 }
 
 static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
@@ -4440,6 +4496,8 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 			mc.moved_swap = 0;
 			mc.moving_task = current;
 
+			mem_cgroup_begin_page_cgroup_reassignment();
+
 			ret = mem_cgroup_precharge_mc(mm);
 			if (ret)
 				mem_cgroup_clear_mc();

--
Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/