linux-kernel - Re: [PATCH 0/2] nvmet: support polling task for RDMA and TCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f9828eb4-39be-498b-8b90-2cb7ce42d3c7@grimberg.me>
Date: Mon, 1 Jul 2024 11:22:31 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: hch@....de, kch@...dia.com, linux-nvme@...ts.infradead.org,
 linux-kernel@...r.kernel.org
Cc: ping.gan@...l.com
Subject: Re: [PATCH 0/2] nvmet: support polling task for RDMA and TCP



On 01/07/2024 10:42, Ping Gan wrote:
>> Hey Ping Gan,
>>
>>
>> On 26/06/2024 11:28, Ping Gan wrote:
>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>> TCP use kworker to handle IO. But if there is other high workload
>>> in the system(eg: on kubernetes), the competition between the
>>> kworker and other workload is very radical. And since the kworker
>>> is scheduled by OS randomly, it's difficult to control OS resource
>>> and also tune the performance. If target support to use delicated
>>> polling task to handle IO, it's useful to control OS resource and
>>> gain good performance. So it makes sense to add polling task in
>>> rdma-rdma and rdma-tcp modules.
>> This is NOT the way to go here.
>>
>> Both rdma and tcp are driven from workqueue context, which are bound
>> workqueues.
>>
>> So there are two ways to go here:
>> 1. Add generic port cpuset and use that to direct traffic to the
>> appropriate set of cores
>> (i.e. select an appropriate comp_vector for rdma and add an appropriate
>> steering rule
>> for tcp).
>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow users
>> to
>> control
>> these UNBOUND workqueues cpumask via sysfs.
>>
>> (2) will not control interrupts to steer to other workloads cpus, but
>> the handlers may
>> run on a set of dedicated cpus.
>>
>> (1) is a better solution, but harder to implement.
>>
>> You also should look into nvmet-fc as well (and nvmet-loop for that
>> matter).
> hi Sagi Grimberg,
> Thanks for your reply, actually we had tried the first advice you
> suggested, but we found the performance was poor when using spdk
> as initiator.

I suggest that you focus on that instead of what you proposed.
What is the source of your poor performance?

>   You know this patch is not only resolving OS resource
> competition issue, but also the perf issue. We have analyzed if we
> still use workqueue(kworker) as target when initiator is polling
> driver(eg: spdk), then workqueue/kworker target is the bottleneck
> since every nvmf request may have a wait latency from queuing on
> workqueue to begin processing,

That is incorrect, the work context polls the cq until it either drains it
completely, or exhaust a quota of IB_POLL_BUDGET_WORKQUEUE (or
NVMET_TCP_IO_WORK_BUDGET). Not every command gets its own workqueue
queuing delay.

And, what does the spdk initiator has to do with it? Didn't understand...

>   and the latency can be traced by wqlat
> of bcc (https://github.com/iovisor/bcc/blob/master/tools/wqlat.py).
> We think the latency is a disaster for the polling driver data plane,
> right?

If you need a target that polls all the time, you should probably resort 
to spdk.
If there is room for optimization in nvmet we'll gladly take it, but 
this is not the
way to go IMO.

> So we think adding a polling task mode on nvmet side to handle
> IO does really make sense; what's your opinion about this?

I personally think that adding a polling kthread is questionable.
However there is a precedent, io_uring sqthreads. So please look
into what is done there. I don't mind having something like 
IB_POLL_IOTASK (or
io_task threads in nvmet-tcp) if its done correctly (leverages common code).

>   And you
> mentioned we should also look into nvmet-fc, I agree with you.
> However currently we have no nvmf-fc's testbed; if we get the testbed,
> will do that.

There is fcloop, you should use that to test, same for loop. We don't want
the transports to diverge in functionality.