lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 19 Oct 2010 18:41:27 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Steven Rostedt <rostedt@...dmis.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Koki Sanagi <sanagi.koki@...fujitsu.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...e.hu>,
	Frederic Weisbecker <fweisbec@...il.com>,
	nhorman@...driver.com, scott.a.mcmillan@...el.com,
	laijs@...fujitsu.com, LKML <linux-kernel@...r.kernel.org>,
	eric.dumazet@...il.com, kaneshige.kenji@...fujitsu.com,
	David Miller <davem@...emloft.net>, izumi.taku@...fujitsu.com,
	kosaki.motohiro@...fujitsu.com,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	"Luck, Tony" <tony.luck@...el.com>, Jason Baron <jbaron@...hat.com>
Subject: Re: [PATCH] tracing: Cleanup the convoluted softirq tracepoints

* H. Peter Anvin (hpa@...or.com) wrote:
> On 10/19/2010 02:23 PM, Steven Rostedt wrote:
> > 
> > But it seemed that gcc for you inlined the code in the wrong spot.
> > Perhaps it's not a good idea to have the something like h - softirq_vec
> > in the parameter of the tracepoint. Not saying that your change is not
> > worth it. It is, because h - softirq_vec is used by others now too.
> > 
> 
> OK, first of all, there are some serious WTFs here:
> 
> # define JUMP_LABEL_INITIAL_NOP ".byte 0xe9 \n\t .long 0\n\t"
> 
> A jump instruction is one of the worst possible NOPs.  Why are we doing
> this?

This code is dynamically patched at boot time (and module load time) with a
better nop, just like the function tracer does.

> 
> The second thing that I found when implementing static_cpu_has() was
> that it is actually better to encapsulate the asm goto in a small inline
> which returns bool (true/false) -- gcc will happily optimize out the
> variable and only see it as a flow of control thing.  I would be very
> curious if that wouldn't make gcc generate better code in cases like that.
> 
> gcc 4.5.0 has a bug in that there must be a flowthrough case in the asm
> goto (you can't have it unconditionally branch one way or the other), so
> that should be the likely case and accordingly it should be annotated
> likely() so that gcc doesn't reorder.  I suspect in the end one ends up
> with code like this:
> 
> static __always_inline __pure bool __switch_point(...)
> {
> 	asm goto("1: " JUMP_LABEL_INITIAL_NOP
> 		 /* ... patching stuff */
> 		: : : : t_jump);
> 	return false;
> t_jump:
> 	return true;
> }
> 
> #define SWITCH_POINT(x) unlikely(__switch_point(x))
> 
> I *suspect* this will resolve the need for hot/cold labels just fine.

Thanks for the hint! We'll make sure to try it out. Having the ability to force
gcc to put the tracepoint in an unlikely branch is deeply needed here.

I'm a bit curious about the nop vs jump overhead comparison you are referring
to. It is an instruction latency benchmark or throughput benchmark ?

Intel's manual "Intel 64 and IA-32 Architectures Optimization Reference Manual"

http://www.intel.com/Assets/PDF/manual/248966.pdf

Page C-33 (or 577 in the pdf)

"7. Selection of conditional jump instructions should be based on the
    recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to
    improve the predictability of branches. When branches are predicted
    successfully, the latency of jcc is effectively zero."

So it mentions "jcc", but not jmp. Is there any reason for jmp to have a higher
latency than jcc ?

In this manual, the latency of predicted jcc is therefore 0 cycle, and its
throughput is 0.5 cycle/insn.

NOP (page C-29) is stated to have a latency of 0.5 to 1 cycle/insn (depending on
the exact HW), and throughput of 0.5 cycle/insn.

However, I have not found "jmp" explicitly in this listing.

So if we were executing tracepoints in a maze of jumps, we could argue that
instruction throughput is the most important there. However, if we expect the
common case to be surrounded by some non-ALU instructions, latency tends to
become the most important criterion.

But I feel I might be missing something important that distinguish "jcc" from
"jmp".

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists