linux-kernel - Re: [PATCH] nvme/tcp: Add support to set the tcp worker cpu affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAFj5m9+1VBuDV5HFM3-BiS9T5cxvdB5BuaZOTCYs8uTMxuTzqw@mail.gmail.com>
Date:   Thu, 27 Apr 2023 22:36:36 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Hannes Reinecke <hare@...e.de>
Cc:     Li Feng <fengli@...rtx.com>, Sagi Grimberg <sagi@...mberg.me>,
        Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>,
        Christoph Hellwig <hch@....de>,
        "open list:NVM EXPRESS DRIVER" <linux-nvme@...ts.infradead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Ming Lei <ming.lei@...hat.com>
Subject: Re: [PATCH] nvme/tcp: Add support to set the tcp worker cpu affinity

On Wed, Apr 26, 2023 at 7:34 PM Hannes Reinecke <hare@...e.de> wrote:
>
> On 4/25/23 10:32, Li Feng wrote:
> > Hi Sagi,
> >
> > On Wed, Apr 19, 2023 at 5:32 PM Sagi Grimberg <sagi@...mberg.me> wrote:
> >>
> >>
> >>>> Hey Li,
> >>>>
> >>>>> The default worker affinity policy is using all online cpus, e.g. from 0
> >>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
> >>>>> have a bad performance.
> >>>>> This patch adds a module parameter to set the cpu affinity for the nvme-tcp
> >>>>> socket worker threads.  The parameter is a comma separated list of CPU
> >>>>> numbers.  The list is parsed and the resulting cpumask is used to set the
> >>>>> affinity of the socket worker threads.  If the list is empty or the
> >>>>> parsing fails, the default affinity is used.
> >>>>
> >>>> I can see how this may benefit a specific set of workloads, but I have a
> >>>> few issues with this.
> >>>>
> >>>> - This is exposing a user interface for something that is really
> >>>> internal to the driver.
> >>>>
> >>>> - This is something that can be misleading and could be tricky to get
> >>>> right, my concern is that this would only benefit a very niche case.
> >>> Our storage products needs this feature~
> >>> If the user doesn’t know what this is, they can keep it default, so I thinks this is
> >>> not unacceptable.
> >>
> >> It doesn't work like that. A user interface is not something exposed to
> >> a specific consumer.
> >>
> >>>> - If the setting should exist, it should not be global.
> >>> V2 has fixed it.
> >>>>
> >>>> - I prefer not to introduce new modparams.
> >>>>
> >>>> - I'd prefer to find a way to support your use-case without introducing
> >>>> a config knob for it.
> >>>>
> >>> I’m looking forward to it.
> >>
> >> If you change queue_work_on to queue_work, ignoring the io_cpu, does it
> >> address your problem?
> > Sorry for the late response, I just got my machine back.
> > Replace the queue_work_on to queue_work, looks like it has a little
> > good performance.
> > The  busy worker is `kworker/56:1H+nvme_tcp_wq`, and fio binds to
> > 90('cpus_allowed=90'),
> > I don't know why the worker 56 is selected.
> > The performance of 256k read up from 1.15GB/s to 1.35GB/s.
> >
> >>
> >> Not saying that this should be a solution though.
> >>
> >> How many queues does your controller support that you happen to use
> >> queue 0 ?
> > Our controller only support one io queue currently.
>
> Ouch.
> Remember, NVMe gets most of the performance improvements by using
> several queues, and be able to bind the queues to cpu sets.
> Exposing just one queue will be invalidating any assumptions we do,
> and trying to improve interrupt steering won't work anyway.
>
> I sincerely doubt we should try to 'optimize' for this rather peculiar
> setup.

Queue isn't free, and it does consume both host and device resources,
especially blk-mq takes static mapping.

Also it may depend on how application uses the nvme/tcp, such as,
io_uring may saturate one device easily in very limited tasks/queues
(one or two or a little more, depends on the device and driver implementation)

Thanks,
Ming