netdev - Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20131101140655.GA8467@hmsreliant.think-freely.org>
Date:	Fri, 1 Nov 2013 10:06:55 -0400
From:	Neil Horman <nhorman@...driver.com>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Eric Dumazet <eric.dumazet@...il.com>,
	linux-kernel@...r.kernel.org, sebastien.dugue@...l.net,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
	netdev@...r.kernel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@...driver.com> wrote:
> 
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@...driver.com> wrote:
> > > 
> > > > > etc. For such short runtimes make sure the last column displays 
> > > > > close to 100%, so that the PMU results become trustable.
> > > > > 
> > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > > 
> > > > > The last colum tells you what percentage of the runtime that 
> > > > > particular event was actually active. 100% (or empty last column) 
> > > > > means it was active all the time.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > > 
> > > > 
> > > > Hmm, 
> > > > 
> > > > I ran this test:
> > > > 
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > > done
> > > 
> > > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > > events, but here you want to use fewer events, to increase the 
> > > precision of the measurement.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Thank you ingo, that fixed it.  I'm trying some other variants of 
> > the csum algorithm that Doug and I discussed last night, but FWIW, 
> > the relative performance of the 4 test cases 
> > (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> > feel like at this point, theres very little point in doing 
> > parallel alu operations (unless we can find a way to break the 
> > dependency on the carry flag, which is what I'm tinkering with 
> > now).
> 
> I would still like to encourage you to pick up the improvements that 
> Doug measured (mostly via prefetch tweaking?) - that looked like 
> some significant speedups that we don't want to lose!
> 
Well, yes, I made a line item of that in my subsequent note below.  I'm going to
repost that shortly, and I suggested that we revisit this when the AVX
instruction extensions are available.

> Also, trying to stick the in-kernel implementation into 'perf bench' 
> would be a useful first step as well, for this and future efforts.
> 
> See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
> up the in-kernel assembly memcpy implementations:
> 
Yes, I'll look into adding this as well
Regards
Neil


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html