[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <5976C351-B2EB-4C25-A2AA-2A64EDD8CCDE@gmail.com>
Date: Tue, 18 Apr 2023 11:29:54 +0800
From: Li Feng <lifeng1519@...il.com>
To: Sagi Grimberg <sagi@...mberg.me>
Cc: Ming Lei <ming.lei@...hat.com>, Keith Busch <kbusch@...nel.org>,
Jens Axboe <axboe@...com>, Christoph Hellwig <hch@....de>,
"open list:NVM EXPRESS DRIVER" <linux-nvme@...ts.infradead.org>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu
affinity
> 2023年4月17日 下午9:33,Sagi Grimberg <sagi@...mberg.me> 写道:
>
>
>>>> On Thu, Apr 13, 2023 at 09:29:41PM +0800, Li Feng wrote:
>>>>> The default worker affinity policy is using all online cpus, e.g. from 0
>>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
>>>>> have a bad performance.
>>>>
>>>> Can you explain in detail how nvme-tcp performs worse in this situation?
>>>>
>>>> If some of CPUs are knows as busy, you can submit the nvme-tcp io jobs
>>>> on other non-busy CPUs via taskset, or scheduler is supposed to choose
>>>> proper CPUs for you. And usually nvme-tcp device should be saturated
>>>> with limited io depth or jobs/cpus.
>>>>
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> Taskset can’t work on nvme-tcp io-queues, because the worker cpu has decided at the nvme-tcp ‘connect’ stage,
>>> not the sending io stage. Assume there is only one io-queue, the binding cpu is CPU0, no matter io jobs
>>> run other cpus.
>> OK, looks the problem is on queue->io_cpu, see nvme_tcp_queue_request().
>> But I am wondering why nvme-tcp doesn't queue the io work on the current
>> cpu? And why is queue->io_cpu introduced? Given blk-mq defines cpu
>> affinities for each hw queue, driver is supposed to submit IO request
>> to hardware on the local CPU.
>> Sagi and Guys, any ideas about introducing queue->io_cpu?
>
> Hey Ming,
>
> I have some vague memories wrt to this, but from what I recall:
>
> - The number of queues is dependent on both the controller and
> the user (Not a reason/motivation on its own, just clarifying).
>
> - It simply matches what pci does (to some extent, outside of rx side
> entropy that may exist), it just happens to take more cpu cycles due to
> the network stack overhead.
>
> - I didn't want io threads to change CPUs because of RFS/aRFS
> optimizations that people use, which allows the NIC to steer interrupts
> (and napi context) to where the io thread is running, and thus minimize
> latency due to improved locality. that on its own was shown to be worth
> over 30% reduction.
>
RFS works not good here. On my aarch64, the NIC irq is handled on NUMA node 2 CPU.
And nvme-tcp io-queue is busy on CPU0.
> - At some point nvme-tcp rx context used to run in softirq, and having
> to synchronize different cores (on different numa nodes maybe, depends
> on what RSS decided) when processing the socket resulted in high
> latency as well. This is not the case today (due to some nics back then
> that surfaced various issues with this) but it may be come back in
> the future again (if shown to provide value).
>
> - Also today when there is a sync network send from .queue_rq path,
> it is only executed when the running cpu == queue->io_cpu, to avoid high
> contention. My concern is that if the io context is not bound to
> a specific cpu, it may create heavier contention on queue serialization.
> Today there are at most 2 contexts that compete, io context (triggered from network rx or scheduled in the submission path) and .queue_rq sync
> network send path. I'd prefer to not have to introduce more contention with increasing number of threads accessing an nvme controller.
>
> Having said that, I don't think there is a fundamental issue with
> using queue_work, or queue_work_node(cur_cpu) or
> queue_work_node(netdev_home_cpu), if that does not introduce
> additional latency in the common cases. Although having io threads
> bounce around is going to regress users that use RFS/aRFS...
Powered by blists - more mailing lists