[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100105163939.a3f146fb.kamezawa.hiroyu@jp.fujitsu.com>
Date: Tue, 5 Jan 2010 16:39:39 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Minchan Kim <minchan.kim@...il.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, cl@...ux-foundation.org,
"hugh.dickins" <hugh.dickins@...cali.co.uk>,
Nick Piggin <nickpiggin@...oo.com.au>,
Ingo Molnar <mingo@...e.hu>
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
On Tue, 5 Jan 2010 14:30:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> ==
> Above codes are a bit heavy(and buggy). I have some fixes.
>
Here is my latest one. But I myself think this should be improved more
and don't think this patch is bug-free. I have some concern
s around fork() etc..and still considering how to avoid atomic ops.
I post this just as a reference rather than cut-out from old e-mails.
Here is a comarison before after patch. A test program does following in
multi-thread. (attached)
while (1) {
touch memory (with write)
pthread_barrier_wait().
fork() if I'm the first thread.
-> exit() immediately
wait()
pthread_barrier_wait()
}
And see overheads by perf. This is result on 8core/2socket hosts.
This is highly affected by the number of sockets.
# Samples: 1164449628090
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
43.23% multi-fault-all [kernel] [k] smp_invalidate_interrupt
16.27% multi-fault-all [kernel] [k] flush_tlb_others_ipi
11.55% multi-fault-all [kernel] [k] _raw_spin_lock_irqsave <========(*)
6.23% multi-fault-all [kernel] [k] intel_pmu_enable_all
2.17% multi-fault-all [kernel] [k] _raw_spin_unlock_irqrestore
1.59% multi-fault-all [kernel] [k] page_fault
1.45% multi-fault-all [kernel] [k] do_wp_page
1.35% multi-fault-all ./multi-fault-all-fork [.] worker
1.26% multi-fault-all [kernel] [k] handle_mm_fault
1.19% multi-fault-all [kernel] [k] _raw_spin_lock
1.03% multi-fault-all [kernel] [k] invalidate_interrupt7
1.00% multi-fault-all [kernel] [k] invalidate_interrupt6
0.99% multi-fault-all [kernel] [k] invalidate_interrupt3
# Samples: 181505050964
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................ ......
#
45.08% multi-fault-all [kernel] [k] smp_invalidate_interrupt
19.45% multi-fault-all [kernel] [k] intel_pmu_enable_all
14.17% multi-fault-all [kernel] [k] flush_tlb_others_ipi
1.89% multi-fault-all [kernel] [k] do_wp_page
1.58% multi-fault-all [kernel] [k] page_fault
1.46% multi-fault-all ./multi-fault-all-fork [.] worker
1.26% multi-fault-all [kernel] [k] do_page_fault
1.14% multi-fault-all [kernel] [k] _raw_spin_lock
1.10% multi-fault-all [kernel] [k] flush_tlb_page
1.09% multi-fault-all [kernel] [k] vma_put <========(**)
0.99% multi-fault-all [kernel] [k] invalidate_interrupt0
0.98% multi-fault-all [kernel] [k] find_vma_speculative <=====(**)
0.81% multi-fault-all [kernel] [k] invalidate_interrupt3
0.81% multi-fault-all [kernel] [k] native_apic_mem_write
0.79% multi-fault-all [kernel] [k] invalidate_interrupt7
0.76% multi-fault-all [kernel] [k] invalidate_interrupt5
(*) is removed overhead of mmap_sem and (**) is added overhead of atomic ops.
And yes, 1% seems still not ideally small. And it's unknown how this false-sharing
of atomic ops can affect to very big SMP.....maybe very big.
BTW, I'm not sure why intel_pmu_enable_all() is very big...because of fork() ?
My patch is below, just as a reference of my work in the last year, not
very clean yet. I need more time for improvements (and does not adhere to
this implementation.)
When you test this, please turn off SPINLOCK_DEBUG to enable split-page-table-lock.
Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Asynchronous page fault.
This patch is for avoidng mmap_sem in usual page fault. At running highly
multi-threaded programs, mm->mmap_sem can use much CPU because of false
sharing when it causes page fault in parallel. (Run after fork() is a typical
case, I think.)
This patch uses a speculative vma lookup to reduce that cost.
Considering vma lookup, rb-tree lookup, the only operation we do is checking
node->rb_left,rb_right. And there are no complicated operation.
At page fault, there are no demands for accessing sorted-vma-list or access
prev or next in many case. Except for stack-expansion, we always need a vma
which contains page-fault address. Then, we can access vma's RB-tree in
speculative way.
Even if RB-tree rotation occurs while we walk tree for look-up, we just
miss vma without oops. In other words, we can _try_ to find vma in lockless
manner. If failed, retry is ok.... we take lock and access vma.
For lockess walking, this uses RCU and adds find_vma_speculative(). And
per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
there are no thread which access the vma when we call subsystem's unmap
functions.
Changelog:
- removed rcu_xxx macros.
- fixed reference count handling bug at vma_put().
- removed atomic_add_unless() etc. But use Big number for handling race.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
---
arch/x86/mm/fault.c | 14 ++++++
include/linux/mm.h | 14 ++++++
include/linux/mm_types.h | 5 ++
mm/mmap.c | 98 +++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 127 insertions(+), 4 deletions(-)
Index: linux-2.6.33-rc2/include/linux/mm_types.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm_types.h
+++ linux-2.6.33-rc2/include/linux/mm_types.h
@@ -11,6 +11,7 @@
#include <linux/rwsem.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
+#include <linux/rcupdate.h>
#include <linux/page-debug-flags.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -180,6 +181,10 @@ struct vm_area_struct {
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
+ atomic_t refcnt;
+ wait_queue_head_t wait_queue;
+ struct rcu_head rcuhead;
+
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
Index: linux-2.6.33-rc2/mm/mmap.c
===================================================================
--- linux-2.6.33-rc2.orig/mm/mmap.c
+++ linux-2.6.33-rc2/mm/mmap.c
@@ -187,6 +187,26 @@ error:
return -ENOMEM;
}
+static void __free_vma_rcu_cb(struct rcu_head *head)
+{
+ struct vm_area_struct *vma;
+ vma = container_of(head, struct vm_area_struct, rcuhead);
+ kmem_cache_free(vm_area_cachep, vma);
+}
+/* Call this if vma was linked to rb-tree */
+static void free_vma_rcu(struct vm_area_struct *vma)
+{
+ call_rcu(&vma->rcuhead, __free_vma_rcu_cb);
+}
+#define VMA_FREE_MAGIC (10000000)
+/* called when vma is unlinked and wait for all racy access.*/
+static void invalidate_vma_before_free(struct vm_area_struct *vma)
+{
+ atomic_sub(VMA_FREE_MAGIC, &vma->refcnt);
+ wait_event(vma->wait_queue,
+ (atomic_read(&vma->refcnt) == -VMA_FREE_MAGIC));
+}
+
/*
* Requires inode->i_mapping->i_mmap_lock
*/
@@ -238,7 +258,7 @@ static struct vm_area_struct *remove_vma
removed_exe_file_vma(vma->vm_mm);
}
mpol_put(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma_rcu(vma);
return next;
}
@@ -404,8 +424,12 @@ __vma_link_list(struct mm_struct *mm, st
void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
struct rb_node **rb_link, struct rb_node *rb_parent)
{
+ atomic_set(&vma->refcnt, 0);
+ init_waitqueue_head(&vma->wait_queue);
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+ /* For speculative vma lookup */
+ smp_wmb();
}
static void __vma_link_file(struct vm_area_struct *vma)
@@ -490,6 +514,7 @@ __vma_unlink(struct mm_struct *mm, struc
rb_erase(&vma->vm_rb, &mm->mm_rb);
if (mm->mmap_cache == vma)
mm->mmap_cache = prev;
+ smp_wmb();
}
/*
@@ -614,6 +639,7 @@ again: remove_next = 1 + (end > next->
* us to remove next before dropping the locks.
*/
__vma_unlink(mm, next, vma);
+ invalidate_vma_before_free(next);
if (file)
__remove_shared_vm_struct(next, file, mapping);
if (next->anon_vma)
@@ -640,7 +666,7 @@ again: remove_next = 1 + (end > next->
}
mm->map_count--;
mpol_put(vma_policy(next));
- kmem_cache_free(vm_area_cachep, next);
+ free_vma_rcu(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
@@ -1544,6 +1570,71 @@ out:
}
/*
+ * Returns vma which contains given address. This scans rb-tree in speculative
+ * way and increment a reference count if found. Even if vma exists in rb-tree,
+ * this function may return NULL in racy case. So, this function cannot be used
+ * for checking whether given address is valid or not.
+ */
+
+struct vm_area_struct *
+find_vma_speculative(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = NULL;
+ struct vm_area_struct *vma_tmp;
+ struct rb_node *rb_node;
+
+ if (unlikely(!mm))
+ return NULL;;
+
+ rcu_read_lock();
+ /*
+ * Barreir against modification of rb-tree
+ * rb-tree update is not an atomic ops and no barreir is used while
+ * modification. Then, modification to rb-tree can be reordered. This
+ * memory barrier is against vma_(un)link_rb() for avoiding to read
+ * too old data to catch all changes we get rcu_read_lock.
+ *
+ * We may see broken RB-tree and can't find existing vma. But it's ok.
+ * We allowed to return NULL even if valid one exists. The caller will
+ * use find_vma() with read-semaphore.
+ */
+
+ smp_read_barrier_depends();
+ rb_node = mm->mm_rb.rb_node;
+ vma = NULL;
+ while (rb_node) {
+ vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+
+ if (vma_tmp->vm_end > addr) {
+ vma = vma_tmp;
+ if (vma_tmp->vm_start <= addr)
+ break;
+ rb_node = rb_node->rb_left;
+ } else
+ rb_node = rb_node->rb_right;
+ }
+ if (vma) {
+ if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
+ if (atomic_inc_return(&vma->refcnt) < 0) {
+ vma_put(vma);
+ vma = NULL;
+ }
+ } else
+ vma = NULL;
+ }
+ rcu_read_unlock();
+ return vma;
+}
+
+void vma_put(struct vm_area_struct *vma)
+{
+ if ((atomic_dec_return(&vma->refcnt) == -VMA_FREE_MAGIC) &&
+ waitqueue_active(&vma->wait_queue))
+ wake_up(&vma->wait_queue);
+ return;
+}
+
+/*
* Verify that the stack growth is acceptable and
* update accounting. This is shared with both the
* grow-up and grow-down cases.
@@ -1808,6 +1899,7 @@ detach_vmas_to_be_unmapped(struct mm_str
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
do {
rb_erase(&vma->vm_rb, &mm->mm_rb);
+ invalidate_vma_before_free(vma);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
Index: linux-2.6.33-rc2/include/linux/mm.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm.h
+++ linux-2.6.33-rc2/include/linux/mm.h
@@ -1235,6 +1235,20 @@ static inline struct vm_area_struct * fi
return vma;
}
+/*
+ * Look up vma for given address in speculative way. This allows lockless lookup
+ * of vmas but in racy case, vma may no be found. You have to call find_vma()
+ * under read lock of mm->mmap_sem even if this function returns NULL.
+ * And, returned vma's reference count is incremented to show vma is accessed
+ * asynchronously, the caller has to call vma_put().
+ *
+ * Unlike find_vma(), this returns a vma which contains specified address.
+ * This doesn't return the nearest vma.
+ */
+extern struct vm_area_struct *find_vma_speculative(struct mm_struct *mm,
+ unsigned long addr);
+void vma_put(struct vm_area_struct *vma);
+
static inline unsigned long vma_pages(struct vm_area_struct *vma)
{
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
Index: linux-2.6.33-rc2/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.33-rc2.orig/arch/x86/mm/fault.c
+++ linux-2.6.33-rc2/arch/x86/mm/fault.c
@@ -952,6 +952,7 @@ do_page_fault(struct pt_regs *regs, unsi
struct mm_struct *mm;
int write;
int fault;
+ int speculative = 0;
tsk = current;
mm = tsk->mm;
@@ -1040,6 +1041,14 @@ do_page_fault(struct pt_regs *regs, unsi
return;
}
+ if ((error_code & PF_USER)) {
+ vma = find_vma_speculative(mm, address);
+ if (vma) {
+ speculative = 1;
+ goto good_area;
+ }
+ }
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
@@ -1136,5 +1145,8 @@ good_area:
check_v8086_mode(regs, address, tsk);
- up_read(&mm->mmap_sem);
+ if (speculative)
+ vma_put(vma);
+ else
+ up_read(&mm->mmap_sem);
}
View attachment "multi-fault-all-fork.c" of type "text/x-csrc" (1862 bytes)
Powered by blists - more mailing lists