[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131018064321.GG14264@gmail.com>
Date: Fri, 18 Oct 2013 08:43:21 +0200
From: Ingo Molnar <mingo@...nel.org>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: Neil Horman <nhorman@...driver.com>,
Eric Dumazet <eric.dumazet@...il.com>,
linux-kernel@...r.kernel.org, sebastien.dugue@...l.net,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, x86@...nel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
* H. Peter Anvin <hpa@...or.com> wrote:
> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> >
> > To correctly simulate the workload you'd have to:
> >
> > - allocate a buffer larger than your L2 cache.
> >
> > - to measure the effects of the prefetches you'd also have to randomize
> > the individual buffer positions. See how 'perf bench numa' implements a
> > random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> > Otherwise the CPU might learn your simplistic stream direction and the
> > L2 cache might hw-prefetch your data, interfering with any explicit
> > prefetches the code does. In many real-life usecases packet buffers are
> > scattered.
> >
> > Also, it would be nice to see standard deviation noise numbers when two
> > averages are close to each other, to be able to tell whether differences
> > are statistically significant or not.
>
>
> Seriously, though, how much does it matter? All the above seems likely
> to do is to drown the signal by adding noise.
I think it matters a lot and I don't think it 'adds' noise - it measures
something else (cache cold behavior - which is the common case for
first-time csum_partial() use for network packets), which was not measured
before, and that that is by its nature has different noise patterns.
I've done many cache-cold measurements myself and had no trouble achieving
statistically significant results and high precision.
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?
Microbenchmarks can be totally blind to things like the ideal prefetch
window size. (or whether a prefetch should be done at all: some CPUs will
throw away prefetches if enough regular fetches arrive.)
Also, 'naive' single-threaded algorithms can occasionally be better in the
cache-cold case because a linear, predictable stream of memory accesses
might saturate the memory bus better than a somewhat random looking,
interleaved web of accesses that might not harmonize with buffer depths.
I _think_ if correctly tuned then the parallel algorithm should be better
in the cache cold case, I just don't know with what parameters (and the
algorithm has at least one free parameter: the prefetch window size), and
I don't know how significant the effect is.
Also, more fundamentally, I absolutely detest doing no measurements or
measuring the wrong thing - IMHO there are too many 'blind' optimization
commits in the kernel with little to no observational data attached.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists