netdev - RE: >10% performance degradation since 2.6.18

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 6 Jul 2009 10:36:11 -0700
From:	"Ma, Chinang" <chinang.ma@...el.com>
To:	Rick Jones <rick.jones2@...com>,
	Herbert Xu <herbert@...dor.apana.org.au>
CC:	Jeff Garzik <jeff@...zik.org>,
	"andi@...stfloor.org" <andi@...stfloor.org>,
	"arjan@...radead.org" <arjan@...radead.org>,
	"matthew@....cx" <matthew@....cx>,
	"jens.axboe@...cle.com" <jens.axboe@...cle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Styner, Douglas W" <douglas.w.styner@...el.com>,
	"Prickett, Terry O" <terry.o.prickett@...el.com>,
	"Wilcox, Matthew R" <matthew.r.wilcox@...el.com>,
	"Eric.Moore@....com" <Eric.Moore@....com>,
	"DL-MPTFusionLinux@....com" <DL-MPTFusionLinux@....com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: RE: >10% performance degradation since 2.6.18

For OLTP workload we are not pushing much network throughput. Lower network latency is more important for OLTP performance. For the original Nehalem 2 sockets OLTP result in this mail thread, we bound the two NIC interrupts to cpu1 and cpu9 (one NIC per sockets). Database processes are divided into two groups and pinned to socket and each processe only received request from the NIC it bound to. This binding scheme gave us >1% performance boost pre-Nehalem date. We also see positive impact on this NHM system.
-Chinang

>-----Original Message-----
>From: Rick Jones [mailto:rick.jones2@...com]
>Sent: Monday, July 06, 2009 10:00 AM
>To: Herbert Xu
>Cc: Jeff Garzik; andi@...stfloor.org; arjan@...radead.org; matthew@....cx;
>jens.axboe@...cle.com; linux-kernel@...r.kernel.org; Styner, Douglas W; Ma,
>Chinang; Prickett, Terry O; Wilcox, Matthew R; Eric.Moore@....com; DL-
>MPTFusionLinux@....com; netdev@...r.kernel.org
>Subject: Re: >10% performance degradation since 2.6.18
>
>Herbert Xu wrote:
>> Jeff Garzik <jeff@...zik.org> wrote:
>>
>>>What's the best setup for power usage?
>>>What's the best setup for performance?
>>>Are they the same?
>>
>>
>> Yes.
>>
>>
>>>Is it most optimal to have the interrupt for socket $X occur on the same
>>>CPU as where the app is running?
>>
>>
>> Yes.
>
>Well...  Yes, if the goal is lowest service demand/latency, but not always
>if
>the goal is to have highest throughput.  For example, basic netperf TCP_RR
>between a pair of systems with NIC interrupts pinned to CPU0 for my
>convenience :)
>
>Pin netperf/netserver to CPU0 as well:
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 0 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   16396.22  0.39   0.55   3.846   5.364
>16384  87380
>
>Now pin it to the peer thread in that same core:
>
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 8 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   14078.23  0.67   0.87   7.604   9.863
>16384  87380
>
>Now pin it to another core in that same processor:
>
>sbs133b15:~ # netperf -H sbs133b16 -t TCP_RR -T 2 -c -C
>TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>sbs133b16.west (10.208.1.50) port 0 AF_INET : first burst 0 : cpu bind
>Local /Remote
>Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>16384  87380  1       1      10.00   14649.57  1.76   0.64   19.213  7.036
>16384  87380
>
>Certainly seems to support "run on the same core as interrupts." Now though
>lets
>look at bulk throughput:
>
>sbs133b15:~ # netperf -H sbs133b16 -T 0 -c -C
>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to sbs133b16.west
>(10.208.1.50) port 0 AF_INET : cpu bind
>Recv   Send    Send                          Utilization       Service
>Demand
>Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
>Size   Size    Size     Time     Throughput  local    remote   local
>remote
>bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB
>us/KB
>
>  87380  16384  16384    10.00      9384.11   3.39     2.19     0.474
>0.306
>
>In this case, I'm running on Nehalems (two quad-cores with threads enabled)
>so I
>have enough "oomph" to hit link-rate on a classic throughput test so all
>these
>next two will show is the CPU hit and some of the run to run variablity:
>
>sbs133b15:~ # for t in 8 2; do netperf -P 0 -H sbs133b16 -T $t -c -C -B
>"bind to
>core $t"; done
>  87380  16384  16384    10.00      9383.67   4.23     5.21     0.591
>0.728
>bind to core 8
>  87380  16384  16384    10.00      9383.12   3.03     5.35     0.423
>0.747
>bind to core 2
>
>So apart from the thing on the top of my head what is my point?  Let's look
>at a
>less conventional but still important case - bulk small packet throughput.
>First, find the limit for a single connection when bound to the interrupt
>core:
>
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 0 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   16336.52  0.69   0.91   6.715   8.944
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   61324.84  2.23   2.27   5.825   5.910
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   152221.78  2.81   3.49   2.956   3.664
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   291247.72  4.86   5.07   2.670   2.788
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   292257.59  3.99   5.91   2.183   3.236
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   291734.00  5.55   5.32   3.043   2.920
>256
>added simultaneous trans
>16384  87380
>
>Now, when bound to the peer thread:
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 8 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   14367.40  0.78   1.75   8.652   19.477
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   54820.22  2.73   4.78   7.956   13.948
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   159305.92  4.61   6.84   4.627   6.874
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   260227.55  6.26   8.36   3.851   5.140
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   256336.50  6.23   8.00   3.891   4.993
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   250543.92  6.24   6.29   3.985   4.014
>256
>added simultaneous trans
>16384  87380
>
>Things still don't look good for running on another CPU, but wait :)  Bind
>to
>another core in the same processor:
>
>sbs133b15:~ # for b in 0 4 16 64 128 256; do netperf -P 0 -t TCP_RR -T 2 -H
>sbs133b16 -c -C -B "$b added simultaneous trans" -- -D -b $b; done
>16384  87380  1       1      10.00   14697.98  0.89   1.53   9.689   16.700
>0
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   58201.08  2.11   4.21   5.804   11.585
>4
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   158999.50  3.87   6.20   3.899   6.240
>16
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   379243.72  6.24   9.04   2.634   3.815
>64
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   384823.34  6.15   9.50   2.556   3.949
>128
>added simultaneous trans
>16384  87380
>16384  87380  1       1      10.00   375001.50  6.07   9.63   2.588   4.109
>256
>added simultaneous trans
>16384  87380
>
>When the CPU does not have enough "oomph" for link-rate 10G, then what we
>see
>above with the aggregate TCP_RR holds true for a plain TCP_STREAM test as
>well -
>getting the second core involved, while indeed increasing CPU util, also
>provides the additional cycles required to get higher thoughput.  So what
>is
>optimal depends on what one wishes to optimize.
>
>>
>>>If yes, how to best handle when the scheduler moves app to another CPU?
>>>Should we reprogram the NIC hardware flow steering mechanism at that
>point?
>>
>>
>> Not really.  For now the best thing to do is to pin everything
>> down and not move at all, because we can't afford to move.
>>
>> The only way for moving to work is if we had the ability to get
>> the sockets to follow the processes.  That means, we must have
>> one RX queue per socket.
>
>Well, or assign sockets to per-core RX queues and be able to move them
>around.
>If it weren't for all the smarts in the NICs getting in the way :), we'd
>probably do the "lookup where the socket was last accessed and run there"
>thing
>somewhere in the inbound path a la TOPS.
>
>rick jones
>
>>
>> Cheers,

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html