[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4d1cf2be-23b6-ed43-972e-bdb9f13c772b@intel.com>
Date: Wed, 20 Sep 2017 09:51:12 -0700
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Tom Herbert <tom@...bertland.com>,
Eric Dumazet <eric.dumazet@...il.com>
Cc: Alexander Duyck <alexander.h.duyck@...el.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: [RFC PATCH] net: Introduce a socket option to enable picking tx
queue based on rx queue.
On 9/20/2017 7:18 AM, Tom Herbert wrote:
> On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
>>> On 9/19/2017 5:48 PM, Tom Herbert wrote:
>>>> On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
>>>> <sridhar.samudrala@...el.com> wrote:
>>>>> On 9/12/2017 3:53 PM, Tom Herbert wrote:
>>>>>> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
>>>>>> <sridhar.samudrala@...el.com> wrote:
>>>>>>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:
>>>>>>>> On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>>>>>>>>> On 9/11/2017 8:53 PM, Eric Dumazet wrote:
>>>>>>>>>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
>>>>>>>>>>
>>>>>>>>>>> Two ints in sock_common for this purpose is quite expensive and the
>>>>>>>>>>> use case for this is limited-- even if a RX->TX queue mapping were
>>>>>>>>>>> introduced to eliminate the queue pair assumption this still won't
>>>>>>>>>>> help if the receive and transmit interfaces are different for the
>>>>>>>>>>> connection. I think we really need to see some very compelling
>>>>>>>>>>> results
>>>>>>>>>>> to be able to justify this.
>>>>>>>>> Will try to collect and post some perf data with symmetric queue
>>>>>>>>> configuration.
>>>>> Here is some performance data i collected with memcached workload over
>>>>> ixgbe 10Gb NIC with mcblaster benchmark.
>>>>> ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
>>>>> low
>>>>> interrupt rate.
>>>>> ethtool -L p1p1 combined 16
>>>>> ethtool -C p1p1 rx-usecs 1000
>>>>> and busy poll is set to 1000usecs
>>>>> sysctl net.core.busy_poll = 1000
>>>>>
>>>>> 16 threads 800K requests/sec
>>>>> =============================
>>>>> rtt(min/avg/max)usecs intr/sec contextswitch/sec
>>>>> -----------------------------------------------------------------------
>>>>> Default 2/182/10641 23391 61163
>>>>> Symmetric Queues 2/50/6311 20457 32843
>>>>>
>>>>> 32 threads 800K requests/sec
>>>>> =============================
>>>>> rtt(min/avg/max)usecs intr/sec contextswitch/sec
>>>>> ------------------------------------------------------------------------
>>>>> Default 2/162/6390 32168 69450
>>>>> Symmetric Queues 2/50/3853 35044 35847
>>>>>
>>>> No idea what "Default" configuration is. Please report how xps_cpus is
>>>> being set, how many RSS queues there are, and what the mapping is
>>>> between RSS queues and CPUs and shared caches. Also, whether and
>>>> threads are pinned.
>>> Default is linux 4.13 with the settings i listed above.
>>> ethtool -L p1p1 combined 16
>>> ethtool -C p1p1 rx-usecs 1000
>>> sysctl net.core.busy_poll = 1000
>>>
>>> # ethtool -x p1p1
>>> RX flow hash indirection table for p1p1 with 16 RX ring(s):
>>> 0: 0 1 2 3 4 5 6 7
>>> 8: 8 9 10 11 12 13 14 15
>>> 16: 0 1 2 3 4 5 6 7
>>> 24: 8 9 10 11 12 13 14 15
>>> 32: 0 1 2 3 4 5 6 7
>>> 40: 8 9 10 11 12 13 14 15
>>> 48: 0 1 2 3 4 5 6 7
>>> 56: 8 9 10 11 12 13 14 15
>>> 64: 0 1 2 3 4 5 6 7
>>> 72: 8 9 10 11 12 13 14 15
>>> 80: 0 1 2 3 4 5 6 7
>>> 88: 8 9 10 11 12 13 14 15
>>> 96: 0 1 2 3 4 5 6 7
>>> 104: 8 9 10 11 12 13 14 15
>>> 112: 0 1 2 3 4 5 6 7
>>> 120: 8 9 10 11 12 13 14 15
>>>
>>> smp_affinity for the 16 queuepairs
>>> 141 p1p1-TxRx-0 0000,00000001
>>> 142 p1p1-TxRx-1 0000,00000002
>>> 143 p1p1-TxRx-2 0000,00000004
>>> 144 p1p1-TxRx-3 0000,00000008
>>> 145 p1p1-TxRx-4 0000,00000010
>>> 146 p1p1-TxRx-5 0000,00000020
>>> 147 p1p1-TxRx-6 0000,00000040
>>> 148 p1p1-TxRx-7 0000,00000080
>>> 149 p1p1-TxRx-8 0000,00000100
>>> 150 p1p1-TxRx-9 0000,00000200
>>> 151 p1p1-TxRx-10 0000,00000400
>>> 152 p1p1-TxRx-11 0000,00000800
>>> 153 p1p1-TxRx-12 0000,00001000
>>> 154 p1p1-TxRx-13 0000,00002000
>>> 155 p1p1-TxRx-14 0000,00004000
>>> 156 p1p1-TxRx-15 0000,00008000
>>> xps_cpus for the 16 Tx queues
>>> 0000,00000001
>>> 0000,00000002
>>> 0000,00000004
>>> 0000,00000008
>>> 0000,00000010
>>> 0000,00000020
>>> 0000,00000040
>>> 0000,00000080
>>> 0000,00000100
>>> 0000,00000200
>>> 0000,00000400
>>> 0000,00000800
>>> 0000,00001000
>>> 0000,00002000
>>> 0000,00004000
>>> 0000,00008000
>>> memcached threads are not pinned.
>>>
>> ...
>>
>> I urge you to take the time to properly tune this host.
>>
>> linux kernel does not do automagic configuration. This is user policy.
>>
>> Documentation/networking/scaling.txt has everything you need.
>>
> Yes, tuning a system for optimal performance is difficult. Even if you
> find a performance benefit for a configuration on one system, that
> might not translate to another. In other words, if you've produced
> some code that seems to perform better than previous implementation on
> a test machine it's not enough to be satisfied with that. We want
> understand _why_ there is a difference. If you can show there is
> intrinsic benefits to the queue-pair model that we can't achieve with
> existing implementation _and_ can show there are ill effects in other
> circumstances, then you should have a good case to make changes.
>
> In the case of memcached, threads inevitably migrate off the CPU they
> were created on, the data follows the thread but the RX-queue does not
> change which means that the receive path is crosses CPUs or caches.
> But, then in the queuepair case that also means transmit completions
> are crossing CPUs. We don't normally expect that to be a good thing.
> However, transmit completion processing does not happen in the
> critical path, so if that work is being deferred to a less busy CPU
> there may benefits. That's only a theory, analysis and experimentation
> should be able to get to the root cause.
>
With regards to tuning, forgot to mention that memcached is updated to
select thethread based on incoming queue via SO_INCOMING_NAPI_ID and
is started with16 threads to match the number of RX queues.
If i do pinning of memcached threads to each of the 16 cores, i do get
similar performance as symmetric queues. But this symmetric queues
configuration
is to support scenarios where it is not possible to pin the threads of the
application.
Thanks
Sridhar
Powered by blists - more mailing lists