[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131015073248.GA25493@gmail.com>
Date: Tue, 15 Oct 2013 09:32:48 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Neil Horman <nhorman@...driver.com>
Cc: linux-kernel@...r.kernel.org, sebastien.dugue@...l.net,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
* Neil Horman <nhorman@...driver.com> wrote:
> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> >
> > * Neil Horman <nhorman@...driver.com> wrote:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant amount
> > > of time computing checksums. We found that by splitting the checksum
> > > computation into two separate streams, each skipping successive elements
> > > of the buffer being summed, we could parallelize the checksum operation
> > > accros multiple alus. Since neither chain is dependent on the result of
> > > the other, we get a speedup in execution (on hardware that has multiple
> > > alu's available, which is almost ubiquitous on x86), and only a
> > > negligible decrease on hardware that has only a single alu (an extra
> > > addition is introduced). Since addition in commutative, the result is
> > > the same, only faster
> >
> > This patch should really come with measurement numbers: what performance
> > increase (and drop) did you get on what CPUs.
> >
> > Thanks,
> >
> > Ingo
> >
>
>
> So, early testing results today. I wrote a test module that, allocated
> a 4k buffer, initalized it with random data, and called csum_partial on
> it 100000 times, recording the time at the start and end of that loop.
It would be nice to stick that testcase into tools/perf/bench/, see how we
are able to benchmark the kernel's mempcy and memset implementation there:
$ perf bench mem memcpy -r help
# Running 'mem/memcpy' benchmark:
Unknown routine:help
Available routines...
default ... Default memcpy() provided by glibc
x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
In a similar fashion we could build the csum_partial() code as well and do
measurements. (We could change arch/x86/ code as well to make such
embedding/including easier, as long as it does not change performance.)
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists