lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 15 Jun 2009 22:47:15 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>, mingo@...hat.com,
	hpa@...or.com, paulus@...ba.org, acme@...hat.com,
	linux-kernel@...r.kernel.org, a.p.zijlstra@...llo.nl,
	penberg@...helsinki.fi, vegard.nossum@...il.com, efault@....de,
	jeremy@...p.org, npiggin@...e.de, tglx@...utronix.de,
	linux-tip-commits@...r.kernel.org
Subject: Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chain
	support to use NMI-safe methods


* Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:

> In the category "crazy ideas one should never express out loud", I 
> could add the following. We could choose to save/restore the cr2 
> register on the local stack at every interrupt entry/exit, and 
> therefore allow the page fault handler to execute with interrupts 
> enabled.
> 
> I have not benchmarked the interrupt disabling overhead of the 
> page fault handler handled by starting an interrupt-gated handler 
> rather than trap-gated handler, but cli/sti instructions are known 
> to take quite a few cycles on some architectures. e.g. 131 cycles 
> for the pair on P4, 23 cycles on AMD Athlon X2 64, 43 cycles on 
> Intel Core2.

The cost on Nehalem (1 billion local_irq_save()+restore() pairs):

 aldebaran:~> perf stat --repeat 5 ./prctl 0 0

 Performance counter stats for './prctl 0 0' (5 runs):

   10950.813461  task-clock-msecs     #      0.997 CPUs    ( +-   1.594% )
              3  context-switches     #      0.000 M/sec   ( +-   0.000% )
              1  CPU-migrations       #      0.000 M/sec   ( +-   0.000% )
            145  page-faults          #      0.000 M/sec   ( +-   0.000% )
    33946294720  cycles               #   3099.888 M/sec   ( +-   1.132% )
     8030365827  instructions         #      0.237 IPC     ( +-   0.006% )
         100933  cache-references     #      0.009 M/sec   ( +-  12.568% )
          27250  cache-misses         #      0.002 M/sec   ( +-   3.897% )

   10.985768499  seconds time elapsed.

That's 33.9 cycles per iteration, with a 1.1% confidence factor.

Annotation gives this result:

    2.24 :      ffffffff810535e5:       9c                      pushfq 
    8.58 :      ffffffff810535e6:       58                      pop    %rax
   10.99 :      ffffffff810535e7:       fa                      cli    
   20.38 :      ffffffff810535e8:       50                      push   %rax
    0.00 :      ffffffff810535e9:       9d                      popfq  
   46.71 :      ffffffff810535ea:       ff c6                   inc    %esi
    0.42 :      ffffffff810535ec:       3b 35 72 31 76 00       cmp    0x763172(%rip),%e
   10.69 :      ffffffff810535f2:       7c f1                   jl     ffffffff810535e5 
    0.00 :      ffffffff810535f4:       e9 7c 01 00 00          jmpq   ffffffff81053775 

i.e. pushfq+cli is roughly 42.19% or 14 cycles, the popfq is 46.71 
or 16 cycles. So the combo cost is 30 cycles, +- 1 cycle.

(Actual effective cost in a real critical section can be better than 
this, dependent on surrounding instructions.)

It got quite a bit faster than Core2 - but still not as fast as AMD.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ