[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090615211209.GA27100@elte.hu>
Date: Mon, 15 Jun 2009 23:12:09 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, mingo@...hat.com,
hpa@...or.com, paulus@...ba.org, acme@...hat.com,
linux-kernel@...r.kernel.org, a.p.zijlstra@...llo.nl,
penberg@...helsinki.fi, vegard.nossum@...il.com, efault@....de,
jeremy@...p.org, npiggin@...e.de, tglx@...utronix.de,
linux-tip-commits@...r.kernel.org
Subject: Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chain
support to use NMI-safe methods
* Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:
> * Ingo Molnar (mingo@...e.hu) wrote:
> >
> > * Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:
> >
> > > In the category "crazy ideas one should never express out loud", I
> > > could add the following. We could choose to save/restore the cr2
> > > register on the local stack at every interrupt entry/exit, and
> > > therefore allow the page fault handler to execute with interrupts
> > > enabled.
> > >
> > > I have not benchmarked the interrupt disabling overhead of the
> > > page fault handler handled by starting an interrupt-gated handler
> > > rather than trap-gated handler, but cli/sti instructions are known
> > > to take quite a few cycles on some architectures. e.g. 131 cycles
> > > for the pair on P4, 23 cycles on AMD Athlon X2 64, 43 cycles on
> > > Intel Core2.
> >
> > The cost on Nehalem (1 billion local_irq_save()+restore() pairs):
> >
> > aldebaran:~> perf stat --repeat 5 ./prctl 0 0
> >
> > Performance counter stats for './prctl 0 0' (5 runs):
> >
> > 10950.813461 task-clock-msecs # 0.997 CPUs ( +- 1.594% )
> > 3 context-switches # 0.000 M/sec ( +- 0.000% )
> > 1 CPU-migrations # 0.000 M/sec ( +- 0.000% )
> > 145 page-faults # 0.000 M/sec ( +- 0.000% )
> > 33946294720 cycles # 3099.888 M/sec ( +- 1.132% )
> > 8030365827 instructions # 0.237 IPC ( +- 0.006% )
> > 100933 cache-references # 0.009 M/sec ( +- 12.568% )
> > 27250 cache-misses # 0.002 M/sec ( +- 3.897% )
> >
> > 10.985768499 seconds time elapsed.
> >
> > That's 33.9 cycles per iteration, with a 1.1% confidence factor.
> >
> > Annotation gives this result:
> >
> > 2.24 : ffffffff810535e5: 9c pushfq
> > 8.58 : ffffffff810535e6: 58 pop %rax
> > 10.99 : ffffffff810535e7: fa cli
> > 20.38 : ffffffff810535e8: 50 push %rax
> > 0.00 : ffffffff810535e9: 9d popfq
> > 46.71 : ffffffff810535ea: ff c6 inc %esi
> > 0.42 : ffffffff810535ec: 3b 35 72 31 76 00 cmp 0x763172(%rip),%e
> > 10.69 : ffffffff810535f2: 7c f1 jl ffffffff810535e5
> > 0.00 : ffffffff810535f4: e9 7c 01 00 00 jmpq ffffffff81053775
> >
> > i.e. pushfq+cli is roughly 42.19% or 14 cycles, the popfq is 46.71
> > or 16 cycles. So the combo cost is 30 cycles, +- 1 cycle.
> >
> > (Actual effective cost in a real critical section can be better than
> > this, dependent on surrounding instructions.)
> >
> > It got quite a bit faster than Core2 - but still not as fast as AMD.
> >
> > Ingo
>
> Interesting, but in our specific case, what would be even more
> interesting to know is how many trap gates/s vs interrupt gates/s
> can be called. This would allow us to see if it's worth trying to
> make the page fault handler interrupt-safe by mean of atomicity
> and context save/restore by interrupt handlers (which would let us
> run the PF handler with interrupts enabled).
See the numbers in the other mail: about 33 million pagefaults
happen in a typical kernel build - that's ~400K/sec - and that is
not a particularly really pagefault-heavy workload.
OTOH, interrupt gates, if above 10K/second, do get noticed and get
reduced. Above 100K/sec combined they are really painful. In
practice, a combo limit of 10K is healthy.
So there's about an order of magnitude difference in the frequency
of IRQs versus the frequency of pagefaults.
In the worst-case, we have 10K irqs/sec and almost zero pagefaults -
every 10 cycles overhead in irq entry+exit cost causes a 0.003%
total slowdown.
So i'd say that it's pretty safe to say that the shuffling of
overhead from the pagefault path into the irq path, even if it's a
zero-sum game as per cycles, is an overall win - or even in the
worst-case, a negligible overhead.
Syscalls are even more critical: it's easy to have a 'good' workload
with millions of syscalls per second - so getting even a single
cycle off the syscall entry+exit path is worth the pain.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists