linux-kernel - RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW prefetch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <C10D3FB0CD45994C8A51FEC1227CE22F27124E99DE@shsmsx502.ccr.corp.intel.com>
Date:	Fri, 1 Jul 2011 18:26:31 +0800
From:	"Ma, Ling" <ling.ma@...el.com>
To:	"Ma, Ling" <ling.ma@...el.com>, Ingo Molnar <mingo@...e.hu>,
	Andi Kleen <andi@...stfloor.org>
CC:	"hpa@...or.com" <hpa@...or.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HW
 prefetch

Sorry for incorrect copy_page_c results from movsb not movsq.

Update results :
(the benchmark is not enough accurate, but it could tell us which is faster)

1. We copy 4096 bytes for 32 times on snb, and extract minimum execution time

On hot cache case:
   Copy_page          copy_page_c    copy_page_sse2 without preftch (128bit write /cycle)   copy_page_sse2 with prefetch (128bit write /cycle)
   437 cycles          226 cycles    183                                                    208
 

2. the same routine with hot-caches, but before each execution we copy
 512k data to push original data out of L1 &L2.
 On cold cache case:

 copy_page(with prefetch)  copy_page(without prefetch)  copy_page_c  copy_page_sse2 without preftch (128bit write /cycle)  copy_page_sse2 with prefetch(128bit write /cycle)
  688~713                  847~860                      636~648      661~673                                               609~615                                         

Answer to the question from Ingo, copy_page_c is always faster to copy page,
but copy_page_c doesn't use prefetch for cold-cache cases, and append prefetch according to copy size.

Thanks
Ling
  



> -----Original Message-----
> From: Ma, Ling
> Sent: Friday, July 01, 2011 4:11 PM
> To: Ma, Ling; 'Ingo Molnar'; 'Andi Kleen'
> Cc: 'hpa@...or.com'; 'tglx@...utronix.de'; 'linux-
> kernel@...r.kernel.org'
> Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> from HW prefetch
> 
> Forget to append experiment data:
> 
> 1. We copy 4096 bytes for 32 times on snb, and extract minimum
> execution time
> On hot cache case:
>   Copy_page          copy_page_c
>   482 cycles          350 cycles
> 
> 2. the same routine with hot-caches, but before each execution we copy
> 512k data to push original data out of L1 &L2.
> On cold cache case:
>   copy_page(with prefetch)    copy_page(without prefetch)
> copy_page_c
>    853~873 cycles                  1037~1051 cycles            959~976
> cycles
> 
> Thanks
> Ling
> 
> > -----Original Message-----
> > From: Ma, Ling
> > Sent: Tuesday, June 28, 2011 11:24 PM
> > To: 'Ingo Molnar'; Andi Kleen
> > Cc: hpa@...or.com; tglx@...utronix.de; linux-kernel@...r.kernel.org
> > Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> > from HW prefetch
> >
> > Hi Ingo
> >
> > > Ling, mind double checking which one is the faster/better one on
> SNB,
> > > in cold-cache and hot-cache situations, copy_page or copy_page_c?
> > Copy_page_c
> > on hot-cache copy_page_c on SNB combines data to 128bit (processor
> > limit 128bit/cycle for write) after startup latency
> > so it is faster than copy_page which provides 64bit/cycle for write.
> >
> > on cold-cache copy_page_c doesn't use prefetch, which uses prfetch
> > according to copy size,
> > so copy_page function is better.
> >
> > Thanks
> > Ling


Download attachment "snb_info" of type "application/octet-stream" (6888 bytes)