linux-kernel - RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BYAPR21MB127077DE03164CA31AE0B33DBFE19@BYAPR21MB1270.namprd21.prod.outlook.com>
Date:   Mon, 19 Jul 2021 20:33:30 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Saeed Mahameed <saeed@...nel.org>,
        Leon Romanovsky <leon@...nel.org>
CC:     "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>,
        "'x86@...nel.org'" <x86@...nel.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Saeed Mahameed <saeed@...nel.org>
> Sent: Monday, July 19, 2021 1:18 PM
> > > ...
> > > It turns out that adding "intremap=off" can work around the issue!
> > >
> > > The root cause is still not clear yet. I don't know why Windows is
> > > good here.
> >
> > The card is stuck in the FW, maybe Saeed knows why. I tried your
> > scenario and it worked for me.
> >
> > Thanks
> 
> I don't think the FW is stuck since we see the cmd completion after
> timeout, this means that the 1st interrupt from the device got lost.
> 
> "wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> CREATE_EQ(0x301) recovered after timeout"
> 
> the fact that this happens on  5.14 and 5.4 kernels and the issue is
> worked around via bringing the cpus online, or disabling intremap,
> means that there is something wrong with the interrupt remapping
> mechanism, maybe the interrupt is being delivered on an offline cpu ?
> is this a qemu/VM guest or a bare metal host ?

Thanks for the replies! 

This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
issue is in the IOMMU Interrupt Remapping mechanism rather in the
NIC driver. I just don't understand why bringing the CPUs online and
offline can work around the issue. I'm trying to dump the IOMMU IR
table entries to look for any error. 

Thanks,
Dexuan