[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240704103533.68118-1-jacky_gam_2001@163.com>
Date: Thu, 4 Jul 2024 18:35:32 +0800
From: Ping Gan <jacky_gam_2001@....com>
To: sagi@...mberg.me,
hch@....de,
kch@...dia.com,
linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Cc: ping.gan@...l.com
Subject: Re: [PATCH 0/2] nvmet: support polling task for RDMA and TCP
> On 7/4/24 11:10, Ping Gan wrote:
>>> On 02/07/2024 13:02, Ping Gan wrote:
>>>>> On 01/07/2024 10:42, Ping Gan wrote:
>>>>>>> Hey Ping Gan,
>>>>>>>
>>>>>>>
>>>>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>>>>> When running nvmf on SMP platform, current nvme target's RDMA
>>>>>>>> and
>>>>>>>> TCP use kworker to handle IO. But if there is other high
>>>>>>>> workload
>>>>>>>> in the system(eg: on kubernetes), the competition between the
>>>>>>>> kworker and other workload is very radical. And since the
>>>>>>>> kworker
>>>>>>>> is scheduled by OS randomly, it's difficult to control OS
>>>>>>>> resource
>>>>>>>> and also tune the performance. If target support to use
>>>>>>>> delicated
>>>>>>>> polling task to handle IO, it's useful to control OS resource
>>>>>>>> and
>>>>>>>> gain good performance. So it makes sense to add polling task in
>>>>>>>> rdma-rdma and rdma-tcp modules.
>>>>>>> This is NOT the way to go here.
>>>>>>>
>>>>>>> Both rdma and tcp are driven from workqueue context, which are
>>>>>>> bound
>>>>>>> workqueues.
>>>>>>>
>>>>>>> So there are two ways to go here:
>>>>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>>>>> appropriate set of cores
>>>>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>>>>> appropriate
>>>>>>> steering rule
>>>>>>> for tcp).
>>>>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>>>>> users
>>>>>>> to
>>>>>>> control
>>>>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>>>>
>>>>>>> (2) will not control interrupts to steer to other workloads
>>>>>>> cpus,
>>>>>>> but
>>>>>>> the handlers may
>>>>>>> run on a set of dedicated cpus.
>>>>>>>
>>>>>>> (1) is a better solution, but harder to implement.
>>>>>>>
>>>>>>> You also should look into nvmet-fc as well (and nvmet-loop for
>>>>>>> that
>>>>>>> matter).
>>>>>> hi Sagi Grimberg,
>>>>>> Thanks for your reply, actually we had tried the first advice you
>>>>>> suggested, but we found the performance was poor when using spdk
>>>>>> as initiator.
>>>>> I suggest that you focus on that instead of what you proposed.
>>>>> What is the source of your poor performance?
>>>> Before these patches, we had used linux's RPS to forward the
>>>> packets
>>>> to a fixed cpu set for nvmet-tcp. But when did that we can still
>>>> not
>>>> cancel the competition between softirq and workqueue since nvme
>>>> target's
>>>> kworker cpu core bind on socket's cpu which is from skb. Besides
>>>> that
>>>> we found workqueue's wait latency was very high even we enabled
>>>> polling
>>>> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
>>>> initiator
>>>> is polling mode, the target of workqueue is the bottleneck. Below
>>>> is
>>>> work's wait latency trace log of our test on our cluster(per node
>>>> uses
>>>> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps
>>>> X
>>>> 2)
>>>> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
>>>> system's CPU and memory were used about 80%.
>>> I'd try a simple unbound CPU case, steer packets to say cores [0-5]
>>> and
>>> assign
>>> the cpumask of the unbound workqueue to cores [6-11].
>> Okay, thanks for your guide.
>>
>>>> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
>>>> 01:06:59
>>>> usecs : count distribution
>>>> 0 -> 1 : 0 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 0 | |
>>>> 8 -> 15 : 3 | |
>>>> 16 -> 31 : 10 | |
>>>> 32 -> 63 : 3 | |
>>>> 64 -> 127 : 2 | |
>>>> 128 -> 255 : 0 | |
>>>> 256 -> 511 : 5 | |
>>>> 512 -> 1023 : 12 | |
>>>> 1024 -> 2047 : 26 |* |
>>>> 2048 -> 4095 : 34 |* |
>>>> 4096 -> 8191 : 350 |************ |
>>>> 8192 -> 16383 : 625 |******************************|
>>>> 16384 -> 32767 : 244 |********* |
>>>> 32768 -> 65535 : 39 |* |
>>>>
>>>> 01:07:00
>>>> usecs : count distribution
>>>> 0 -> 1 : 1 | |
>>>> 2 -> 3 : 0 | |
>>>> 4 -> 7 : 4 | |
>>>> 8 -> 15 : 3 | |
>>>> 16 -> 31 : 8 | |
>>>> 32 -> 63 : 10 | |
>>>> 64 -> 127 : 3 | |
>>>> 128 -> 255 : 6 | |
>>>> 256 -> 511 : 8 | |
>>>> 512 -> 1023 : 20 |* |
>>>> 1024 -> 2047 : 19 |* |
>>>> 2048 -> 4095 : 57 |** |
>>>> 4096 -> 8191 : 325 |**************** |
>>>> 8192 -> 16383 : 647 |******************************|
>>>> 16384 -> 32767 : 228 |*********** |
>>>> 32768 -> 65535 : 43 |** |
>>>> 65536 -> 131071 : 1 | |
>>>>
>>>> And the bandwidth of a node is only 3100MB. While we used the patch
>>>> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
>>>> improvement.
>>> I think you will see similar performance with unbound workqueue and
>>> rps.
>> Yes, I remodified the nvmet-tcp/nvmet-rdma code for supporting
>> unbound
>> workqueue, and in same prerequisites of above to run test, and
>> compared
>> the result of unbound workqueue and polling mode task. And I got a
>> good
>> performance for unbound workqueue. For unbound workqueue TCP we got
>> 3850M/node, it's almost equal to polling task. And also tested
>> nvmet-rdma
>> we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
>> task,
>> seems the diff is very small. Anyway, your advice is good.
>
> I'm a bit surprised that you see ~10% delta here. I would look into
> what
> is the root-cause of
> this difference. If indeed the load is high, the overhead of the
> workqueue mgmt should be
> negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?
Yes, we used IB_POLL_UNBOUND_WORKQUEUE to create ib CQ. And I observed
3% CPU
usage of unbound workqueue versus 6% of polling task.
>> Do you think
>> we
>> should submit the unbound workqueue patches for nvmet-tcp and
>> nvmet-rdma
>> to upstream nvmet?
>
> For nvmet-tcp, I think there is merit to split socket processing from
> napi context. For nvmet-rdma
> I think the only difference is if you have multiple CQs assigned with
> the same comp_vector.
>
> How many queues do you have in your test?
We used 24 IO queues to nvmet-rdma target. I think this may also be
related to workqueue's wait latency. We still see some several ms wait
latency for unbound workqueue of RMDA. You can see below trace log.
ogden-brown:~ # /usr/share/bcc/tools/wqlat -T -w ib-comp-unb-wq 1 3
Tracing work queue request latency time... Hit Ctrl-C to end.
10:09:10
usecs : count distribution
0 -> 1 : 6 | |
2 -> 3 : 105 |** |
4 -> 7 : 1732 |******************************|
8 -> 15 : 1597 |******************************|
16 -> 31 : 526 |************ |
32 -> 63 : 543 |************ |
64 -> 127 : 950 |********************* |
128 -> 255 : 1335 |***************************** |
256 -> 511 : 1534 |******************************|
512 -> 1023 : 1039 |*********************** |
1024 -> 2047 : 592 |************* |
2048 -> 4095 : 112 |** |
4096 -> 8191 : 6 | |
10:09:11
usecs : count distribution
0 -> 1 : 3 | |
2 -> 3 : 62 |* |
4 -> 7 : 1459 |***************************** |
8 -> 15 : 1869 |******************************|
16 -> 31 : 612 |************* |
32 -> 63 : 478 |********** |
64 -> 127 : 844 |****************** |
128 -> 255 : 1123 |************************ |
256 -> 511 : 1278 |*************************** |
512 -> 1023 : 1113 |*********************** |
1024 -> 2047 : 632 |************* |
2048 -> 4095 : 158 |*** |
4096 -> 8191 : 18 | |
8192 -> 16383 : 1 | |
10:09:12
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 68 |* |
4 -> 7 : 1399 |*************************** |
8 -> 15 : 1822 |******************************|
16 -> 31 : 559 |************ |
32 -> 63 : 513 |*********** |
64 -> 127 : 906 |******************* |
128 -> 255 : 1217 |*********************** |
256 -> 511 : 1391 |*************************** |
512 -> 1023 : 1135 |************************ |
1024 -> 2047 : 569 |************ |
2048 -> 4095 : 110 |** |
4096 -> 8191 : 26 | |
8192 -> 16383 : 11 | |
Thanks,
Ping
Powered by blists - more mailing lists