netdev - RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <01cd01d41dce$992f4f30$cb8ded90$@opengridcomputing.com>
Date:   Tue, 17 Jul 2018 08:03:45 -0500
From:   "Steve Wise" <swise@...ngridcomputing.com>
To:     "'Max Gurtovoy'" <maxg@...lanox.com>,
        "'Sagi Grimberg'" <sagi@...mberg.me>,
        "'Leon Romanovsky'" <leon@...nel.org>
Cc:     "'Doug Ledford'" <dledford@...hat.com>,
        "'Jason Gunthorpe'" <jgg@...lanox.com>,
        "'RDMA mailing list'" <linux-rdma@...r.kernel.org>,
        "'Saeed Mahameed'" <saeedm@...lanox.com>,
        "'linux-netdev'" <netdev@...r.kernel.org>
Subject: RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask


> On 7/16/2018 8:08 PM, Steve Wise wrote:
> > Hey Max:
> >
> >
> 
> Hey,
> 
> > On 7/16/2018 11:46 AM, Max Gurtovoy wrote:
> >>
> >>
> >> On 7/16/2018 5:59 PM, Sagi Grimberg wrote:
> >>>
> >>>> Hi,
> >>>> I've tested this patch and seems problematic at this moment.
> >>>
> >>> Problematic how? what are you seeing?
> >>
> >> Connection failures and same error Steve saw:
> >>
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: Connect command failed, error
> >> wo/DNR bit: -16402
> >> [Mon Jul 16 16:19:11 2018] nvme nvme0: failed to connect queue: 2 ret=-
> 18
> >>
> >>
> >>>
> >>>> maybe this is because of the bug that Steve mentioned in the NVMe
> >>>> mailing list. Sagi mentioned that we should fix it in the NVMe/RDMA
> >>>> initiator and I'll run his suggestion as well.
> >>>
> >>> Is your device irq affinity linear?
> >>
> >> When it's linear and the balancer is stopped the patch works.
> >>
> >>>
> >>>> BTW, when I run the blk_mq_map_queues it works for every irq
> affinity.
> >>>
> >>> But its probably not aligned to the device vector affinity.
> >>
> >> but I guess it's better in some cases.
> >>
> >> I've checked the situation before Leon's patch and set all the vetcors
> >> to CPU 0. In this case (I think that this was the initial report by
> >> Steve), we use the affinity_hint (Israel's and Saeed's patches were we
> >> use dev->priv.irq_info[vector].mask) and it worked fine.
> >>
> >> Steve,
> >> Can you share your configuration (kernel, HCA, affinity map, connect
> >> command, lscpu) ?
> >> I want to repro it in my lab.
> >>
> >
> > - linux-4.18-rc1 + the nvme/nvmet inline_data_size patches + patches to
> > enable ib_get_vector_affinity() in cxgb4 + sagi's patch + leon's mlx5
> > patch so I can change the affinity via procfs.
> 
> ohh, now I understand that you where complaining regarding the affinity
> change reflection to mlx5_ib_get_vector_affinity and not regarding the
> failures on connecting while the affinity overlaps (that is working good
> before Leon's patch).
> So this is a known issue since we used a static hint that never changes
> from dev->priv.irq_info[vector].mask.
> 
> IMO we must fulfil the user wish to connect to N queues and not reduce
> it because of affinity overlaps. So in order to push Leon's patch we
> must also fix the blk_mq_rdma_map_queues to do a best effort mapping
> according the affinity and map the rest in naive way (in that way we
> will *always* map all the queues).

That is what I would expect also.   For example, in my node, where there are
16 cpus, and 2 numa nodes, I observe much better nvmf IOPS performance by
setting up my 16 driver completion event queues such that each is bound to a
node-local cpu.  So I end up with each nodel-local cpu having 2 queues bound
to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(), this
works fine.   I assumed adding ib_get_vector_affinity() would allow this to
all "just work" by default, but I'm running into this connection failure
issue. 

I don't understand exactly what the blk_mq layer is trying to do, but I
assume it has ingress event queues and processing that it trying to align
with the drivers ingress cq event handling, so everybody stays on the same
cpu (or at least node).   But something else is going on.  Is there
documentation on how this works somewhere?

Thanks,

Steve