[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.00.1106171040460.7018@sister.anvils>
Date: Fri, 17 Jun 2011 10:41:20 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Andi Kleen <ak@...ux.intel.com>,
Shaohua Li <shaohua.li@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
David Miller <davem@...emloft.net>,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Russell King <rmk@....linux.org.uk>,
Paul Mundt <lethal@...ux-sh.org>,
Jeff Dike <jdike@...toit.com>,
Richard Weinberger <richard@....at>,
"Luck, Tony" <tony.luck@...el.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Mel Gorman <mel@....ul.ie>, Nick Piggin <npiggin@...nel.dk>,
Namhyung Kim <namhyung@...il.com>,
"Shi, Alex" <alex.shi@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"Rafael J. Wysocki" <rjw@...k.pl>
Subject: Re: REGRESSION: Performance regressions from switching anon_vma->lock
to mutex
On Fri, 17 Jun 2011, Linus Torvalds wrote:
> On Fri, Jun 17, 2011 at 4:28 AM, Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:
> >
> > Something like so? Compiles and runs the benchmark in question.
>
> Yup.
>
> Except I really think that test for a NULL anon_vma should go away.
>
> If an avc entry has a NULL anon_vma, something is seriously wrong. The
> comment about anon_vma_fork failure is definitely just bogus: the
> anon_vma is allocated before the avc entry, so there's no way a avc
> can have a NULL anon_vma from there.
>
> But yes, your patch is cleaner than the one I was playing around with
> (your "remove if not list empty" is prettier than what I was toying
> with - having a separate flag in the avc)
>
> Tim, can you test Peter's (second - the cleaned up one) patch on top
> of mine, and see if that helps things further?
>
> The only thing I don't love about the batching is that we now do hold
> the lock over some situations where we _could_ have allowed
> concurrency (notably some avc allocations), but I think it's a good
> trade-off. And walking the list twice at unlink_anon_vmas() should be
> basically free.
Applying load with those two patches applied (combined patch shown at
the bottom, in case you can tell me I misunderstood what to apply,
and have got the wrong combination on), lockdep very soon protested.
I've not given it _any_ thought, and won't be able to come back to
it for a couple of hours: chucked over the wall for your delectation.
Hugh
[ 65.981291] =================================
[ 65.981354] [ INFO: inconsistent lock state ]
[ 65.981393] 3.0.0-rc3 #2
[ 65.981418] ---------------------------------
[ 65.981456] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 65.981513] cp/1335 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 65.981556] (&anon_vma->mutex){+.+.?.}, at: [<781ba4b3>] page_lock_anon_vma+0xd6/0x130
[ 65.981644] {RECLAIM_FS-ON-W} state was registered at:
[ 65.981688] [<7817954f>] mark_held_locks+0x46/0x67
[ 65.981738] [<78179a1c>] lockdep_trace_alloc+0x7d/0x96
[ 65.981791] [<781c8572>] kmem_cache_alloc+0x21/0xe1
[ 65.981842] [<781baae9>] anon_vma_clone+0x38/0x124
[ 65.981892] [<781babf7>] anon_vma_fork+0x22/0xf3
[ 65.981940] [<7814f971>] dup_mmap+0x1b7/0x302
[ 65.981986] [<78150063>] dup_mm+0xa5/0x150
[ 65.982030] [<7815075c>] copy_process+0x62e/0xbeb
[ 65.982079] [<78150e1d>] do_fork+0xd5/0x1fc
[ 65.982123] [<7812ca7c>] sys_clone+0x1c/0x21
[ 65.982169] [<78500c31>] ptregs_clone+0x15/0x24
[ 65.982218] irq event stamp: 4625633
[ 65.982251] hardirqs last enabled at (4625633): [<784fe18a>] mutex_trylock+0xe7/0x118
[ 65.982323] hardirqs last disabled at (4625632): [<784fe0f0>] mutex_trylock+0x4d/0x118
[ 65.982394] softirqs last enabled at (4624962): [<781568a6>] __do_softirq+0xf5/0x104
[ 65.982467] softirqs last disabled at (4624835): [<78128007>] do_softirq+0x56/0xa7
[ 65.982537]
[ 65.982538] other info that might help us debug this:
[ 65.982595] Possible unsafe locking scenario:
[ 65.982596]
[ 65.982649] CPU0
[ 65.982672] ----
[ 65.982696] lock(&anon_vma->mutex);
[ 65.982738] <Interrupt>
[ 65.982762] lock(&anon_vma->mutex);
[ 65.982805]
[ 65.982806] *** DEADLOCK ***
[ 65.982807]
[ 65.982864] no locks held by cp/1335.
[ 65.982896]
[ 65.982897] stack backtrace:
[ 65.982939] Pid: 1335, comm: cp Not tainted 3.0.0-rc3 #2
[ 65.984010] Call Trace:
[ 65.984010] [<784fd0d6>] ? printk+0xf/0x11
[ 65.984010] [<78177ef2>] print_usage_bug+0x152/0x15f
[ 65.984010] [<78177fa0>] mark_lock_irq+0xa1/0x1e9
[ 65.984010] [<78176c8d>] ? print_irq_inversion_bug+0x16e/0x16e
[ 65.984010] [<781782f3>] mark_lock+0x20b/0x2d9
[ 65.984010] [<781784bf>] mark_irqflags+0xfe/0x115
[ 65.984010] [<781789cb>] __lock_acquire+0x4f5/0x6ba
[ 65.984010] [<78178f72>] lock_acquire+0x4a/0x60
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<784feaf4>] mutex_lock_nested+0x45/0x297
[ 65.984010] [<781ba4b3>] ? page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba4b3>] page_lock_anon_vma+0xd6/0x130
[ 65.984010] [<781ba48f>] ? page_lock_anon_vma+0xb2/0x130
[ 65.984010] [<781ba6c0>] page_referenced_anon+0x12/0x189
[ 65.984010] [<781ba8ba>] page_referenced+0x83/0xaf
[ 65.984010] [<781a8301>] shrink_active_list+0x186/0x240
[ 65.984010] [<781a8e1f>] shrink_zone+0x158/0x1ce
[ 65.984010] [<781a949b>] shrink_zones+0x94/0xe4
[ 65.984010] [<781a9545>] do_try_to_free_pages+0x5a/0x1db
[ 65.984010] [<781a301d>] ? get_page_from_freelist+0x2c4/0x2e1
[ 65.984010] [<781a9865>] try_to_free_pages+0x6c/0x73
[ 65.984010] [<781a3526>] __alloc_pages_nodemask+0x3aa/0x563
[ 65.984010] [<781a4d03>] __do_page_cache_readahead+0xee/0x1cd
[ 65.984010] [<781a4fa4>] ra_submit+0x19/0x1b
[ 65.984010] [<781a51b4>] ondemand_readahead+0x20e/0x219
[ 65.984010] [<781a525b>] page_cache_sync_readahead+0x3e/0x4b
[ 65.984010] [<7819e0ad>] do_generic_file_read.clone.0+0xd1/0x420
[ 65.984010] [<78178c14>] ? lock_release_non_nested+0x84/0x243
[ 65.984010] [<7819ef25>] generic_file_aio_read+0x1c0/0x1f4
[ 65.984010] [<78178ed7>] ? __lock_release+0x104/0x10f
[ 65.984010] [<781b16d0>] ? might_fault+0x45/0x84
[ 65.984010] [<781d3650>] do_sync_read+0x91/0xc5
[ 65.984010] [<781d71a8>] ? cp_new_stat64+0xd8/0xed
[ 65.984010] [<781d3cac>] vfs_read+0x8d/0xf5
[ 65.984010] [<781d3d50>] sys_read+0x3c/0x63
[ 65.984010] [<78500b50>] sysenter_do_call+0x12/0x36
>From this combined patch applied to 3.0-rc3:
--- 3.0-rc3/mm/rmap.c 2011-05-29 18:42:37.465882779 -0700
+++ linux/mm/rmap.c 2011-06-17 10:19:10.592857382 -0700
@@ -200,6 +200,32 @@ int anon_vma_prepare(struct vm_area_stru
return -ENOMEM;
}
+/*
+ * This is a useful helper function for locking the anon_vma root as
+ * we traverse the vma->anon_vma_chain, looping over anon_vma's that
+ * have the same vma.
+ *
+ * Such anon_vma's should have the same root, so you'd expect to see
+ * just a single mutex_lock for the whole traversal.
+ */
+static inline struct anon_vma *lock_anon_vma_root(struct anon_vma *root, struct anon_vma *anon_vma)
+{
+ struct anon_vma *new_root = anon_vma->root;
+ if (new_root != root) {
+ if (WARN_ON_ONCE(root))
+ mutex_unlock(&root->mutex);
+ root = new_root;
+ mutex_lock(&root->mutex);
+ }
+ return root;
+}
+
+static inline void unlock_anon_vma_root(struct anon_vma *root)
+{
+ if (root)
+ mutex_unlock(&root->mutex);
+}
+
static void anon_vma_chain_link(struct vm_area_struct *vma,
struct anon_vma_chain *avc,
struct anon_vma *anon_vma)
@@ -208,13 +234,11 @@ static void anon_vma_chain_link(struct v
avc->anon_vma = anon_vma;
list_add(&avc->same_vma, &vma->anon_vma_chain);
- anon_vma_lock(anon_vma);
/*
* It's critical to add new vmas to the tail of the anon_vma,
* see comment in huge_memory.c:__split_huge_page().
*/
list_add_tail(&avc->same_anon_vma, &anon_vma->head);
- anon_vma_unlock(anon_vma);
}
/*
@@ -224,16 +248,23 @@ static void anon_vma_chain_link(struct v
int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
{
struct anon_vma_chain *avc, *pavc;
+ struct anon_vma *root = NULL;
list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma;
+
avc = anon_vma_chain_alloc();
if (!avc)
goto enomem_failure;
- anon_vma_chain_link(dst, avc, pavc->anon_vma);
+ anon_vma = pavc->anon_vma;
+ root = lock_anon_vma_root(root, anon_vma);
+ anon_vma_chain_link(dst, avc, anon_vma);
}
+ unlock_anon_vma_root(root);
return 0;
enomem_failure:
+ unlock_anon_vma_root(root);
unlink_anon_vmas(dst);
return -ENOMEM;
}
@@ -280,7 +311,9 @@ int anon_vma_fork(struct vm_area_struct
get_anon_vma(anon_vma->root);
/* Mark this anon_vma as the one where our new (COWed) pages go. */
vma->anon_vma = anon_vma;
+ anon_vma_lock(anon_vma);
anon_vma_chain_link(vma, avc, anon_vma);
+ anon_vma_unlock(anon_vma);
return 0;
@@ -291,36 +324,42 @@ int anon_vma_fork(struct vm_area_struct
return -ENOMEM;
}
-static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
-{
- struct anon_vma *anon_vma = anon_vma_chain->anon_vma;
- int empty;
-
- /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
- if (!anon_vma)
- return;
-
- anon_vma_lock(anon_vma);
- list_del(&anon_vma_chain->same_anon_vma);
-
- /* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
- anon_vma_unlock(anon_vma);
-
- if (empty)
- put_anon_vma(anon_vma);
-}
-
void unlink_anon_vmas(struct vm_area_struct *vma)
{
struct anon_vma_chain *avc, *next;
+ struct anon_vma *root = NULL;
/*
* Unlink each anon_vma chained to the VMA. This list is ordered
* from newest to oldest, ensuring the root anon_vma gets freed last.
*/
list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
- anon_vma_unlink(avc);
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ /* If anon_vma_fork fails, we can get an empty anon_vma_chain. */
+ if (anon_vma) {
+ root = lock_anon_vma_root(root, anon_vma);
+ list_del(&avc->same_anon_vma);
+ /* Leave empty anon_vmas on the list. */
+ if (list_empty(&anon_vma->head))
+ continue;
+ }
+ list_del(&avc->same_vma);
+ anon_vma_chain_free(avc);
+ }
+ unlock_anon_vma_root(root);
+
+ /*
+ * Iterate the list once more, it now only contains empty and unlinked
+ * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
+ * needing to acquire the anon_vma->root->mutex.
+ */
+ list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
+ struct anon_vma *anon_vma = avc->anon_vma;
+
+ if (anon_vma)
+ put_anon_vma(anon_vma);
+
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists