linux-kernel - Re: [PATCH v2 2/2] x86: add prefetching to do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131106181108.GA11233@neilslaptop.think-freely.org>
Date:	Wed, 6 Nov 2013 13:11:08 -0500
From:	Neil Horman <nhorman@...driver.com>
To:	Joe Perches <joe@...ches.com>
Cc:	Dave Jones <davej@...hat.com>, linux-kernel@...r.kernel.org,
	sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 
It was not, but I did not bother to try since accurate branch predicion in the
loop and prefetch issuing should be very fast.  Id also be worried that capping
prefetch would be relatively hardware specific, so what worked well on some
hardware, wouldn't be enough on other hardware.  I'd rather just issue the
prefetch for the whole buffer, as that should produce consistent results.

> Also pedantry/trivial comments:
> 
> __always_inline instead of inline
> static __always_inline void prefetch_lines(const void *addr, size_t len)
> {
> 	const void *end = addr + len;
> ...
> 
> buff doesn't need a void * cast in prefetch_lines
> 
ACK

> Beside the commit message, the comment above prefetch_lines
> also needs updating to remove the "Manual Prefetching" line.
> 
Yup, Dave noted the Manual Prefetch issue, and I'll move the whole comment as
part of that.

> /*
>  * Do a 64-bit checksum on an arbitrary memory area.
>  * Returns a 32bit checksum.
>  *
>  * This isn't as time critical as it used to be because many NICs
>  * do hardware checksumming these days.
>  * 
>  * Things tried and found to not make it faster:
>  * Manual Prefetching
>  * Unrolling to an 128 bytes inner loop.
>  */
> 
> 
> 

Regards
Neil

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/