[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1381789127.2045.22.camel@edumazet-glaptop.roam.corp.google.com>
Date: Mon, 14 Oct 2013 15:18:47 -0700
From: Eric Dumazet <eric.dumazet@...il.com>
To: Neil Horman <nhorman@...driver.com>
Cc: Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>
> > So, early testing results today. I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop. Results on a 2.4
> > GHz Intel Xeon processor:
> >
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
>
> Impressive, but could you try again with data out of cache ?
So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)
Using a prefetch 5*64([%src]) helps more (see at the end)
cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz
Before patch :
lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 20.00 7651.61 2.51 5.45 0.645 1.399
After patch :
lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 20.00 7239.78 2.09 5.19 0.569 1.408
Profile on receiver
PerfTop: 1358 irqs/sec kernel:98.5% exact: 0.0% [1000Hz cycles], (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------
19.99% [kernel] [k] csum_partial
7.04% [kernel] [k] copy_user_generic_string
4.92% [bnx2x] [k] bnx2x_rx_int
3.50% [kernel] [k] ipt_do_table
2.86% [kernel] [k] __netif_receive_skb_core
2.35% [kernel] [k] fib_table_lookup
2.19% [kernel] [k] netif_receive_skb
1.87% [kernel] [k] intel_idle
1.65% [kernel] [k] kmem_cache_alloc
1.64% [kernel] [k] ip_rcv
1.51% [kernel] [k] kmem_cache_free
And attached patch brings much better results
lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 8043.82 2.32 5.34 0.566 1.304
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
zero = 0;
count64 = count >> 3;
while (count64) {
- asm("addq 0*8(%[src]),%[res]\n\t"
+ asm("prefetch 5*64(%[src])\n\t"
+ "addq 0*8(%[src]),%[res]\n\t"
"adcq 1*8(%[src]),%[res]\n\t"
"adcq 2*8(%[src]),%[res]\n\t"
"adcq 3*8(%[src]),%[res]\n\t"
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists