lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.0811171325260.18283@nehalem.linux-foundation.org>
Date:	Mon, 17 Nov 2008 13:34:33 -0800 (PST)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Ingo Molnar <mingo@...e.hu>
cc:	Eric Dumazet <dada1@...mosbay.com>,
	David Miller <davem@...emloft.net>, rjw@...k.pl,
	linux-kernel@...r.kernel.org, kernel-testers@...r.kernel.org,
	cl@...ux-foundation.org, efault@....de, a.p.zijlstra@...llo.nl,
	Stephen Hemminger <shemminger@...tta.com>
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on
 each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> 
> this function _really_ hurts from a 16-bit op:
> 
> ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> ffffffff80489445:        0 	00 00 
> ffffffff80489447:   174101 	5b                   	pop    %rbx

I don't think that is it, actually. The 16-bit store just before it had a 
zero count, even though anything that executes the second one will always 
execute the first one too.

The fact is, x86 profiles are subtle at an instruction level, and you tend 
to get profile hits _after_ the instruction that caused the cost because 
an interrupt (even an NMI) is always delayed to the next instruction (the 
one that didn't complete). And since the core will execute out-of-order, 
you don't even know what that one is, since there could easily be 
branches, but even in the absense of branches you have many instructions 
executing together.

For example, in many situations the two 16-bit stores will happily execute 
together, and what you see may simply be a cache miss on the line that was 
stored to. The store buffer needs to resolve the read of the "pop" in 
order to complete, so having a big count in between stores and a 
subsequent load is not all that unlikely.

So doing per-instruction profiling is not useful unless you start looking 
at what preceded the instruction, and because of the out-of-order nature, 
you really almost have to look for cache misses or branch mispredicts.

One common reason for such a big count on an instruction that looks 
perfectly simple is often that there is a branch to that instruction that 
was mispredicted. Or that there was an instruction that was costly _long_ 
before, and that other instructions were in the shadow of that one 
completing (ie they had actually completed first, but didn't retire until 
the earlier instruction did).

So you really should never just look at the previous instruction or 
anythign as simplistic as that. The time of in-order execution is long 
past.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ