[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53B44C9A.9070808@nellans.org>
Date: Wed, 02 Jul 2014 13:16:58 -0500
From: David Nellans <david@...lans.org>
To: Dave Hansen <dave@...1.net>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "linux-mm@...ck.org" <linux-mm@...ck.org>,
"hpa@...or.com" <hpa@...or.com>,
"mingo@...hat.com" <mingo@...hat.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"x86@...nel.org" <x86@...nel.org>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"riel@...hat.com" <riel@...hat.com>,
"mgorman@...e.de" <mgorman@...e.de>
Subject: Re: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)
On 07/01/2014 11:48 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@...ux.intel.com>
>
> This has been run through Intel's LKP tests across a wide range
> of modern sytems and workloads and it wasn't shown to make a
> measurable performance difference positive or negative.
>
> Now that we have some shiny new tracepoints, we can actually
> figure out what the heck is going on.
>
> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> are for a single page. It breaks down like this:
>
> size percent percent<=
> V V V
> GLOBAL: 2.20% 2.20% avg cycles: 2283
> 1: 56.92% 59.12% avg cycles: 1276
> 2: 13.78% 72.90% avg cycles: 1505
> 3: 8.26% 81.16% avg cycles: 1880
> 4: 7.41% 88.58% avg cycles: 2447
> 5: 1.73% 90.31% avg cycles: 2358
> 6: 1.32% 91.63% avg cycles: 2563
> 7: 1.14% 92.77% avg cycles: 2862
> 8: 0.62% 93.39% avg cycles: 3542
> 9: 0.08% 93.47% avg cycles: 3289
> 10: 0.43% 93.90% avg cycles: 3570
> 11: 0.20% 94.10% avg cycles: 3767
> 12: 0.08% 94.18% avg cycles: 3996
> 13: 0.03% 94.20% avg cycles: 4077
> 14: 0.02% 94.23% avg cycles: 4836
> 15: 0.04% 94.26% avg cycles: 5699
> 16: 0.06% 94.32% avg cycles: 5041
> 17: 0.57% 94.89% avg cycles: 5473
> 18: 0.02% 94.91% avg cycles: 5396
> 19: 0.03% 94.95% avg cycles: 5296
> 20: 0.02% 94.96% avg cycles: 6749
> 21: 0.18% 95.14% avg cycles: 6225
> 22: 0.01% 95.15% avg cycles: 6393
> 23: 0.01% 95.16% avg cycles: 6861
> 24: 0.12% 95.28% avg cycles: 6912
> 25: 0.05% 95.32% avg cycles: 7190
> 26: 0.01% 95.33% avg cycles: 7793
> 27: 0.01% 95.34% avg cycles: 7833
> 28: 0.01% 95.35% avg cycles: 8253
> 29: 0.08% 95.42% avg cycles: 8024
> 30: 0.03% 95.45% avg cycles: 9670
> 31: 0.01% 95.46% avg cycles: 8949
> 32: 0.01% 95.46% avg cycles: 9350
> 33: 3.11% 98.57% avg cycles: 8534
> 34: 0.02% 98.60% avg cycles: 10977
> 35: 0.02% 98.62% avg cycles: 11400
>
> We get in to dimishing returns pretty quickly. On pre-IvyBridge
> CPUs, we used to set the limit at 8 pages, and it was set at 128
> on IvyBrige. That 128 number looks pretty silly considering that
> less than 0.5% of the flushes are that large.
>
> The previous code tried to size this number based on the size of
> the TLB. Good idea, but it's error-prone, needs maintenance
> (which it didn't get up to now), and probably would not matter in
> practice much.
>
> Settting it to 33 means that we cover the mallopt
> M_TRIM_THRESHOLD, which is the most universally common size to do
> flushes.
>
> That's the short version. Here's the long one for why I chose 33:
>
> 1. These numbers have a constant bias in the timestamps from the
> tracing. Probably counts for a couple hundred cycles in each of
> these tests, but it should be fairly _even_ across all of them.
> The smallest delta between the tracepoints I have ever seen is
> 335 cycles. This is one reason the cycles/page cost goes down in
> general as the flushes get larger. The true cost is nearer to
> 100 cycles.
> 2. A full flush is more expensive than a single invlpg, but not
> by much (single percentages).
> 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
> (~34 cycles). At those rates, refilling the 512-entry dTLB takes
> 22,000 cycles.
> 4. 22,000 cycles is approximately the equivalent of doing 85
> invlpg operations. But, the odds are that the TLB can
> actually be filled up faster than that because TLB misses that
> are close in time also tend to leverage the same caches.
> 6. ~98% of flushes are <=33 pages. There are a lot of flushes of
> 33 pages, probably because libc's M_TRIM_THRESHOLD is set to
> 128k (32 pages)
> 7. I've found no consistent data to support changing the IvyBridge
> vs. SandyBridge tunable by a factor of 16
>
> I used the performance counters on this hardware (IvyBridge i5-3320M)
> to figure out the tlb miss costs:
>
> ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush
>
> 7,720,030,970 dtlb_load_misses_walk_duration [57.13%]
> 169,856,353 dtlb_load_misses_walk_completed [57.15%]
> 708,832,859 dtlb_store_misses_walk_duration [57.17%]
> 19,346,823 dtlb_store_misses_walk_completed [57.17%]
> 2,779,687,402 itlb_misses_walk_duration [57.15%]
> 82,241,148 itlb_misses_walk_completed [57.13%]
> 770,717 itlb_itlb_flush [57.11%]
>
> Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
> (~34 cycles). At those rates, refilling the 512-entry dTLB takes
> 22,000 cycles. On a SandyBridge system with more cores and larger
> caches, those are dtlb=13.4ns and itlb=9.5ns.
Intuition here is that invalidate caused refills will almost always be serviced from the L2
or better since we've recently walked to modify the page needing flush and thus pre-warmed the caches
for any refill? Or is this an artifact of the flush/refill test setup? Main mem latency even on Ivybridge is ~100
clocks, worse in previous generations, so to get down to average ~30 cycle refill you basically can never be
missing in the L1 or maybe L2 which seems optimistic.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists