netdev - Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131112205916.GE19780@hmsreliant.think-freely.org>
Date:	Tue, 12 Nov 2013 15:59:16 -0500
From:	Neil Horman <nhorman@...driver.com>
To:	Joe Perches <joe@...ches.com>
Cc:	netdev <netdev@...r.kernel.org>, Dave Jones <davej@...hat.com>,
	linux-kernel@...r.kernel.org, sebastien.dugue@...l.net,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
	Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]

On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote:
> On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
> > On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> > > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> []
> > > > So, the numbers are correct now that I returned my hardware to its previous
> > > > interrupt affinity state, but the trend seems to be the same (namely that there
> > > > isn't a clear one).  We seem to find peak performance around a readahead of 2
> > > > cachelines, but its very small (about 3%), and its inconsistent (larger set
> > > > sizes fall to either side of that stride).  So I don't see it as a clear win.  I
> > > > still think we should probably scrap the readahead for now, just take the perf
> > > > bits, and revisit this when we can use the vector instructions or the
> > > > independent carry chain instructions to improve this more consistently.
> > > > 
> > > > Thoughts
> > > 
> > > Perhaps a single prefetch, not of the first addr but of
> > > the addr after PREFETCH_STRIDE would work best but only
> > > if length is > PREFETCH_STRIDE.
> > > 
> > > I'd try:
> > > 
> > > 	if (len > PREFETCH_STRIDE)
> > > 		prefetch(buf + PREFETCH_STRIDE);
> > > 	while (count64) {
> > > 		etc...
> > > 	}
> > > 
> > > I still don't know how much that impacts very short lengths.
> > > Can you please add a 20 byte length to your tests?
> > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> > only if the overall length of the input buffer is more than 2 cache lines.
> > Below are the results (all counts are the average of 1000000 iterations of the
> > csum operation, as previous tests were, I just omitted that column).
> > 
> > len	set	cycles/byte	cycles/byte	improvement
> > 		no prefetch	prefetch
> > ===========================================================
> > 20B	64MB	45.014989	44.402432	1.3%
> > 20B	128MB	44.900317	46.146447	-2.7%
> > 20B	256MB	45.303223	48.193623	-6.3%
> > 20B	512MB	45.615301	44.486872	2.2%
> []
> > I'm still left thinking we should just abandon the prefetch at this point and
> > keep the perf code until we have new instructions to help us with this further,
> > unless you see something I dont.
> 
> I tend to agree but perhaps the 3% performance
> increase with a prefetch for longer lengths is
> actually significant and desirable.
> 
Maybe, but I worry that its not going to be consistent.  At least not with the
cost of the extra comparison and jump.

> It doesn't seem you've done the test I suggested
> where prefetch is done only for
> "len > PREFETCH_STRIDE".
> 
No, thats exactly what I did, I did this:

#define PREFETCH_STRIDE (cache_line_size() * 2)

...

if (len > PREFETCH_STRIDE)
	prefecth(buf + PREFETCH_STRIDE)

while (count64) {
	...

> Is it ever useful to do a prefetch of the
> address/data being accessed by the next
> instruction?
> 
Doubtful, you need to prefetch the data far enough in advance that its loaded by
the time you need to reference it.  Otherwise you wind up stalling the data
pipeline while the load completes.  So unless you have really fast memory, the
prefetch is effectively a no-op for the next access.  But the next cacheline
ahead is good, as it prevents the stall there.  Any more than that though (from
this testing), seems to again be a no-op as modern hardware automatically issues
the prefetch because it notices the linear data access pattern.

> Anyway, thanks for doing all the work.
> 
No worries, glad to do it.  Thanks for the review

Ingo, what do you think, shall I submit the perf bits as a separate thread, or
do you not want those any more?

Regards
Neil

> Joe
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html