lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tuj9guzq.ffs@tglx>
Date:   Sat, 28 Aug 2021 22:44:09 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Dexuan Cui <decui@...rosoft.com>,
        'Saeed Mahameed' <saeed@...nel.org>,
        'Leon Romanovsky' <leon@...nel.org>
Cc:     "'linux-pci@...r.kernel.org'" <linux-pci@...r.kernel.org>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>,
        "'x86@...nel.org'" <x86@...nel.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: RE: [5.14-rc1] mlx5_core receives no interrupts with maxcpus=8

Dexuan,

On Sat, Aug 28 2021 at 01:53, Thomas Gleixner wrote:
> On Thu, Aug 19 2021 at 20:41, Dexuan Cui wrote:
>>> Sorry for the late response! I checked the below sys file, and the output is
>>> exactly the same in the good/bad cases -- in both cases, I use maxcpus=8;
>>> the only difference in the good case is that I online and then offline CPU 8~31:
>>> for i in `seq 8 31`;  do echo 1 >  /sys/devices/system/cpu/cpu$i/online; done
>>> for i in `seq 8 31`;  do echo 0 >  /sys/devices/system/cpu/cpu$i/online; done
>>> 
>>> # cat /sys/kernel/debug/irq/irqs/209

Yes, that looks correct.

>>
>> I tried the kernel parameter "intremap=nosid,no_x2apic_optout,nopost" but
>> it didn't help. Only "intremap=off" can work round the no interrupt issue.
>>
>> When the no interrupt issue happens, irq 209's effective_affinity_list is 5.
>> I modified modify_irte() to print the irte->low, irte->high, and I also printed
>> the irte_index for irq 209, and they were all normal to me, and they were
>> exactly the same in the bad case and the good case -- it looks like, with
>> "intremap=on maxcpus=8", MSI-X on CPU5 can't work for the NIC device
>> (MSI-X on CPU5 works for other devices like a NVMe controller) , and somehow
>> "onlining and then offlining CPU 8~31" can "fix" the issue, which is really weird.

Just for the record: maxcpus=N is a dangerous boot option as it leaves
the non brought up CPUs in a state where they can be hit by MCE
broadcasting without being able to act on it. Which means you're
operating the system out of spec.

According to your debug output the interrupt in question belongs to the
INTEL-IR-3 interrupt domain, which means it hangs of IOMMU3, aka DMAR
unit 3.

To which DMAR/remap unit are the other unaffected devices connected to?

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ