lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 19 Aug 2021 20:41:29 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     'Thomas Gleixner' <tglx@...utronix.de>,
        'Saeed Mahameed' <saeed@...nel.org>,
        'Leon Romanovsky' <leon@...nel.org>
CC:     "'linux-pci@...r.kernel.org'" <linux-pci@...r.kernel.org>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>,
        "'x86@...nel.org'" <x86@...nel.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

> From: Dexuan Cui
> Sent: Wednesday, August 18, 2021 2:08 PM
> 
> > From: Thomas Gleixner <tglx@...utronix.de>
> > Sent: Wednesday, July 21, 2021 2:17 PM
> > To: Dexuan Cui <decui@...rosoft.com>; Saeed Mahameed
> >
> > On Mon, Jul 19 2021 at 20:33, Dexuan Cui wrote:
> > > This is a bare metal x86-64 host with Intel CPUs. Yes, I believe the
> > > issue is in the IOMMU Interrupt Remapping mechanism rather in the
> > > NIC driver. I just don't understand why bringing the CPUs online and
> > > offline can work around the issue. I'm trying to dump the IOMMU IR
> > > table entries to look for any error.
> >
> > can you please enable GENERIC_IRQ_DEBUGFS and provide the output of
> >
> > cat /sys/kernel/debug/irq/irqs/$THENICIRQS
> >
> > Thanks,
> >
> >         tglx
> 
> Sorry for the late response! I checked the below sys file, and the output is
> exactly the same in the good/bad cases -- in both cases, I use maxcpus=8;
> the only difference in the good case is that I online and then offline CPU 8~31:
> for i in `seq 8 31`;  do echo 1 >  /sys/devices/system/cpu/cpu$i/online; done
> for i in `seq 8 31`;  do echo 0 >  /sys/devices/system/cpu/cpu$i/online; done
> 
> # cat /sys/kernel/debug/irq/irqs/209
> ...

I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost" but
it didn't help. Only "intremap=off" can work round the no interrupt issue.

When the no interrupt issue happens, irq 209's effective_affinity_list is 5.
I modified modify_irte() to print the irte->low, irte->high, and I also printed
the irte_index for irq 209, and they were all normal to me, and they were
exactly the same in the bad case and the good case -- it looks like, with
"intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device
(MSI-X on CPU5 works for other devices like a NVMe controller) , and somehow
"onlining and then offlining CPU 8~31" can "fix" the issue, which is really weird.

Thanks,
Dexuan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ