lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 19 Jul 2021 20:33:30 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Saeed Mahameed <saeed@...nel.org>,
        Leon Romanovsky <leon@...nel.org>
CC:     "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>,
        "'x86@...nel.org'" <x86@...nel.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Saeed Mahameed <saeed@...nel.org>
> Sent: Monday, July 19, 2021 1:18 PM
> > > ...
> > > It turns out that adding "intremap=off" can work around the issue!
> > >
> > > The root cause is still not clear yet. I don't know why Windows is
> > > good here.
> >
> > The card is stuck in the FW, maybe Saeed knows why. I tried your
> > scenario and it worked for me.
> >
> > Thanks
> 
> I don't think the FW is stuck since we see the cmd completion after
> timeout, this means that the 1st interrupt from the device got lost.
> 
> "wait_func_handle_exec_timeout:1062:(pid 1416): cmd[0]:
> CREATE_EQ(0x301) recovered after timeout"
> 
> the fact that this happens on  5.14 and 5.4 kernels and the issue is
> worked around via bringing the cpus online, or disabling intremap,
> means that there is something wrong with the interrupt remapping
> mechanism, maybe the interrupt is being delivered on an offline cpu ?
> is this a qemu/VM guest or a bare metal host ?

Thanks for the replies! 

This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
issue is in the IOMMU Interrupt Remapping mechanism rather in the
NIC driver. I just don't understand why bringing the CPUs online and
offline can work around the issue. I'm trying to dump the IOMMU IR
table entries to look for any error. 

Thanks,
Dexuan

Powered by blists - more mailing lists