lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHckoCz2ZTiSB6WQ8t5m6A9z6zC3BHid12_QFoPe5VGeq5Wwgw@mail.gmail.com>
Date:   Mon, 17 Apr 2023 16:32:56 +0800
From:   Li Feng <fengli@...rtx.com>
To:     Hannes Reinecke <hare@...e.de>
Cc:     David Laight <David.Laight@...lab.com>,
        Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>,
        Christoph Hellwig <hch@....de>,
        Sagi Grimberg <sagi@...mberg.me>,
        "open list:NVM EXPRESS DRIVER" <linux-nvme@...ts.infradead.org>,
        open list <linux-kernel@...r.kernel.org>,
        "lifeng1519@...il.com" <lifeng1519@...il.com>
Subject: Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu affinity

On Mon, Apr 17, 2023 at 2:27 PM Hannes Reinecke <hare@...e.de> wrote:
>
> On 4/15/23 23:06, David Laight wrote:
> > From: Li Feng
> >> Sent: 14 April 2023 10:35
> >>>
> >>> On 4/13/23 15:29, Li Feng wrote:
> >>>> The default worker affinity policy is using all online cpus, e.g. from 0
> >>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
> >>>> have a bad performance.
> >>>>
> >>>> This patch adds a module parameter to set the cpu affinity for the nvme-tcp
> >>>> socket worker threads.  The parameter is a comma separated list of CPU
> >>>> numbers.  The list is parsed and the resulting cpumask is used to set the
> >>>> affinity of the socket worker threads.  If the list is empty or the
> >>>> parsing fails, the default affinity is used.
> >>>>
> > ...
> >>> I am not in favour of this.
> >>> NVMe-over-Fabrics has _virtual_ queues, which really have no
> >>> relationship to the underlying hardware.
> >>> So trying to be clever here by tacking queues to CPUs sort of works if
> >>> you have one subsystem to talk to, but if you have several where each
> >>> exposes a _different_ number of queues you end up with a quite
> >>> suboptimal setting (ie you rely on the resulting cpu sets to overlap,
> >>> but there is no guarantee that they do).
> >>
> >> Thanks for your comment.
> >> The current io-queues/cpu map method is not optimal.
> >> It is stupid, and just starts from 0 to the last CPU, which is not configurable.
> >
> > Module parameters suck, and passing the buck to the user
> > when you can't decide how to do something isn't a good idea either.
> >
> > If the system is busy pinning threads to cpus is very hard to
> > get right.
> >
> > It can be better to set the threads to run at the lowest RT
> > priority - so they have priority over all 'normal' threads
> > and also have a very sticky (but not fixed) cpu affinity so
> > that all such threads tends to get spread out by the scheduler.
> > This all works best if the number of RT threads isn't greater
> > than the number of physical cpu.
> >
> And the problem is that you cannot give an 'optimal' performance metric
> here. With NVMe-over-Fabrics the number of queues is negotiated during
> the initial 'connect' call, and the resulting number of queues strongly
> depends on target preferences (eg a NetApp array will expose only 4
> queues, with Dell/EMC you end up with up max 128 queues).
> And these queues need to be mapped on the underlying hardware, which has
> its own issues wrt to NUMA affinity.
>
> To give you an example:
> Given a setup with a 4 node NUMA machine, one NIC connected to
> one NUMA core, each socket having 24 threads, the NIC exposing up to 32
> interrupts, and connections to a NetApp _and_ a EMC, how exactly should
> the 'best' layout look like?
> And, what _is_ the 'best' layout?
> You cannot satisfy the queue requirements from NetApp _and_ EMC, as you
> only have one NIC, and you cannot change the interrupt affinity for each
> I/O.
>
Not all users have so many NIC cards that they can have one NIC per NUMA node.
This scenario is quite common that only has one NIC.

There doesn’t exist a ‘best' layout for all cases,
So add this parameter to let users select what they want.

> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@...e.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ