linux-kernel - Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1381789127.2045.22.camel@edumazet-glaptop.roam.corp.google.com>
Date:	Mon, 14 Oct 2013 15:18:47 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Neil Horman <nhorman@...driver.com>
Cc:	Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
	sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> 
> > So, early testing results today.  I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop.  Results on a 2.4
> > GHz Intel Xeon processor:
> > 
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
> 
> Impressive, but could you try again with data out of cache ?

So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)

Using a prefetch 5*64([%src]) helps more (see at the end)

cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz

Before patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7651.61   2.51     5.45     0.645   1.399  

After patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7239.78   2.09     5.19     0.569   1.408  

Profile on receiver

   PerfTop:    1358 irqs/sec  kernel:98.5%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

    19.99%  [kernel]     [k] csum_partial                
     7.04%  [kernel]     [k] copy_user_generic_string    
     4.92%  [bnx2x]      [k] bnx2x_rx_int                
     3.50%  [kernel]     [k] ipt_do_table                
     2.86%  [kernel]     [k] __netif_receive_skb_core    
     2.35%  [kernel]     [k] fib_table_lookup            
     2.19%  [kernel]     [k] netif_receive_skb           
     1.87%  [kernel]     [k] intel_idle                  
     1.65%  [kernel]     [k] kmem_cache_alloc            
     1.64%  [kernel]     [k] ip_rcv                      
     1.51%  [kernel]     [k] kmem_cache_free             

And attached patch brings much better results

lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
+				asm("prefetch 5*64(%[src])\n\t"
+				    "addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
 				    "adcq 3*8(%[src]),%[res]\n\t"

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/