linux-kernel - Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5576042E.9030001@intel.com>
Date:	Mon, 08 Jun 2015 14:07:58 -0700
From:	Dave Hansen <dave.hansen@...el.com>
To:	Ingo Molnar <mingo@...nel.org>
CC:	Mel Gorman <mgorman@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Hugh Dickins <hughd@...gle.com>,
	Minchan Kim <minchan@...nel.org>,
	Andi Kleen <andi@...stfloor.org>,
	H Peter Anvin <hpa@...or.com>, Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

On 06/08/2015 12:52 PM, Ingo Molnar wrote:
> A CR3 driven TLB flush takes less time than a single INVLPG (!):
> 
>    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
>    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
>    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles

How was that measured, btw?  Are these instructions running in a loop?
Does __flush_tlb_one() include the tracepoint?

(From the commit I referenced) This was (probably) using a different
method than you did, but "FULL" below is __flush_tlb() while "1" is
__flush_tlb_one().  The "cycles" includes some overhead from the tracing:

>       FULL:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
>          1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895

So it looks like we've got some discrepancy, either from the test
methodology or the CPU.  All of the code and my methodology are in the
commit.  Could you share yours?

> it's true that a full flush has hidden costs not measured above, because it has 
> knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> bad due to:
> 
>   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
>     which latency can be overlaid
> 
>   - global bit being held for kernel entries
> 
>   - user-space with high memory pressure trashing through TLBs typically
> 
> ... and especially with caches and Intel's historically phenomenally low TLB 
> refill latency it's difficult to measure the effects of local TLB refills, let 
> alone measure it in any macro benchmark.

All that you're saying there is that you need to consider how TLB misses
act in _practice_ and not just measure worst-case or theoretical TLB
miss cost.  I completely agree with that.

...
> INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as 
> clear-cut as it might look.

It's clear as mud!

I'd be very interested to see any numbers for how this affects real
workloads.  I've been unable to find anything that was measurably
affected by invlpg vs a full flush.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/