linux-kernel - Re: [PATCH 00/27] Latest numa/core release, v16

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121120090637.GA14873@gmail.com>
Date:	Tue, 20 Nov 2012 10:06:37 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	David Rientjes <rientjes@...gle.com>
Cc:	Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Turner <pjt@...gle.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <rientjes@...gle.com> wrote:

> On Tue, 20 Nov 2012, Ingo Molnar wrote:
> 
> > > This happened to be an Opteron (but not 83xx series), 2.4Ghz.  
> > 
> > Ok - roughly which family/model from /proc/cpuinfo?
> 
> It's close enough, it's 23xx.

Ok - which family/model number in /proc/cpuinfo?

I'm asking because that will matter most to page fault 
micro-characteristics and the 23xx series existed in Barcelona 
form as well (family/model of 16/2) and it still exists in its 
current Shanghai form as well.

My guess is Barcelona 16/2?

If that is correct then the closest I can get to your topology 
is a 4-socket 32-way Opteron system with 32 GB of RAM - which 
seems close enough for testing purposes.

But checking numa/core on such a system still keeps me 
absolutely puzzled, as I get the following with a similar 
16-warehouses SPECjbb 2005 test, using java -Xms8192m -Xmx8192m 
-Xss256k sizing, THP enabled, 2x 240 seconds runs (I tried to 
configure it all very close to yours), using -tip-a07005cbd847:

         kernel             warehouses    transactions/sec
         ---------          ----------
         v3.7-rc6:          16            197802
                            16            197997

         numa/core:         16            203086
                            16            203967

So sadly numa/core is about 2%-3% faster on this 4x4 system too!
:-/

But I have to say, your SPECjbb score is uncharacteristically 
low even for an oddball-topology Barcelona system - which is the 
oldest/slowest system I can think of. So there might be more to 
this.

To further characterise a "good" SPECjbb run, there's no 
page_fault overhead visible in perf top:

Mainline profile:

    94.99%  perf-1244.map     [.] 0x00007f04cd1aa523                 
     2.52%  libjvm.so         [.] 0x00000000007004a1                 
     0.62%  [vdso]            [.] 0x0000000000000972                 
     0.31%  [kernel]          [k] clear_page_c                       
     0.17%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.11%  [kernel]          [k] rep_nop                            
     0.09%  [kernel]          [k] ktime_get                          
     0.08%  [kernel]          [k] get_cycles                         
     0.06%  [kernel]          [k] read_tsc                           
     0.05%  libc-2.15.so      [.] __strcmp_sse2                      

numa/core profile:

    95.66%  perf-1201.map     [.] 0x00007fe4ad1c8fc7                 
     1.70%  libjvm.so         [.] 0x0000000000381581                 
     0.59%  [vdso]            [.] 0x0000000000000607                 
     0.19%  [kernel]          [k] do_raw_spin_lock                   
     0.11%  [kernel]          [k] generic_smp_call_function_interrupt
     0.11%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.08%  [kernel]          [k] ktime_get                          
     0.06%  [kernel]          [k] get_cycles                         
     0.05%  [kernel]          [k] __native_flush_tlb                 
     0.05%  [kernel]          [k] rep_nop                            
     0.04%  perf              [.] add_hist_entry.isra.9              
     0.04%  [kernel]          [k] rcu_check_callbacks                
     0.04%  [kernel]          [k] ktime_get_update_offsets           
     0.04%  libc-2.15.so      [.] __strcmp_sse2                      

No page fault overhead (see the page fault rate further below) - 
the NUMA scanning overhead shows up only through some mild TLB 
flush activity (which I'll fix btw).

[ Stupid question: cpufreq is configured to always-2.4GHz, 
  right? If you could send me your kernel config (you can do 
  that privately as well) then I can try to boot it and see. ]

> > > It's perf top -U, the benchmark itself was unchanged so I 
> > > didn't think it was interesting to gather the user 
> > > symbols.  If that would be helpful, let me know!
> > 
> > Yeah, regular perf top output would be very helpful to get a 
> > general sense of proportion. Thanks!
> 
> Ok, here it is:
> 
>     91.24%  perf-10971.map    [.] 0x00007f116a6c6fb8                            
>      1.19%  libjvm.so         [.] instanceKlass::oop_push_contents(PSPromotionMa
>      1.04%  libjvm.so         [.] PSPromotionManager::drain_stacks_depth(bool)  
>      0.79%  libjvm.so         [.] PSPromotionManager::copy_to_survivor_space(oop
>      0.60%  libjvm.so         [.] PSPromotionManager::claim_or_forward_internal_
>      0.58%  [kernel]          [k] page_fault                                    
>      0.28%  libc-2.3.6.so     [.] __gettimeofday                                        
>      0.26%  libjvm.so         [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
>      0.22%  [kernel]          [k] getnstimeofday                                        
>      0.18%  libjvm.so         [.] CardTableExtension::scavenge_contents_parallel(ObjectS
>      0.15%  [kernel]          [k] _raw_spin_lock                                        
>      0.12%  [kernel]          [k] ktime_get_update_offsets                              
>      0.11%  [kernel]          [k] ktime_get                                             
>      0.11%  [kernel]          [k] rcu_check_callbacks                                   
>      0.10%  [kernel]          [k] generic_smp_call_function_interrupt                   
>      0.10%  [kernel]          [k] read_tsc                                              
>      0.10%  [kernel]          [k] clear_page_c                                          
>      0.10%  [kernel]          [k] __do_page_fault                                       
>      0.08%  [kernel]          [k] handle_mm_fault                                       
>      0.08%  libjvm.so         [.] os::javaTimeMillis()                                  
>      0.08%  [kernel]          [k] emulate_vsyscall                                      

Oh, finally a clue: you seem to have vsyscall emulation 
overhead!

Vsyscall emulation is fundamentally page fault driven - which 
might explain why you are seeing page fault overhead. It might 
also interact with other sources of faults - such as numa/core's 
working set probing ...

Many JVMs try to be smart with the vsyscall. As a test, does the 
vsyscall=native boot option change the results/behavior in any 
way?

Stupid question, if you apply the patch attached below and if 
you do page fault profiling while the run is in steady state:

   perf record -e faults -g -a sleep 10

do you see it often coming from the vsyscall page?

Also, this:

   perf stat -e faults -a --repeat 10 sleep 1

should normally report something like this during SPECjbb steady 
state, numa/core:

          warmup:   3,895 faults/sec                ( +- 12.11% )
    steady state:   3,910 faults/sec                ( +-  6.72% )

Which is about 250 faults/sec/CPU - i.e. it should be barely 
recognizable in profiles - let alone be prominent as in yours.

Thanks,

	Ingo

---
 arch/x86/mm/fault.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -1030,6 +1030,9 @@ __do_page_fault(struct pt_regs *regs, un
 	/* Get the faulting address: */
 	address = read_cr2();
 
+	/* Instrument as early as possible: */
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
 	/*
 	 * Detect and handle instructions that would cause a page fault for
 	 * both a tracked kernel page and a userspace page.
@@ -1107,8 +1110,6 @@ __do_page_fault(struct pt_regs *regs, un
 		}
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	/*
 	 * If we're in an interrupt, have no user context or are running
 	 * in an atomic region then we must not take the fault:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/