netdev - Re: system hang on start-up (mlx5?)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d793d9f-0fca-2b0d-2a2e-abd527ffa8d4@nvidia.com>
Date: Tue, 30 May 2023 18:08:21 +0300
From: Shay Drory <shayd@...dia.com>
To: Eli Cohen <elic@...dia.com>, Chuck Lever III <chuck.lever@...cle.com>
CC: Leon Romanovsky <leon@...nel.org>, Saeed Mahameed <saeedm@...dia.com>,
	linux-rdma <linux-rdma@...r.kernel.org>, "open list:NETWORKING [GENERAL]"
	<netdev@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de>
Subject: Re: system hang on start-up (mlx5?)


On 30/05/2023 16:54, Eli Cohen wrote:
>> -----Original Message-----
>> From: Chuck Lever III <chuck.lever@...cle.com>
>> Sent: Tuesday, 30 May 2023 16:51
>> To: Eli Cohen <elic@...dia.com>
>> Cc: Shay Drory <shayd@...dia.com>; Leon Romanovsky <leon@...nel.org>;
>> Saeed Mahameed <saeedm@...dia.com>; linux-rdma <linux-
>> rdma@...r.kernel.org>; open list:NETWORKING [GENERAL]
>> <netdev@...r.kernel.org>; Thomas Gleixner <tglx@...utronix.de>
>> Subject: Re: system hang on start-up (mlx5?)
>>
>>
>>
>>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@...dia.com> wrote:
>>>
>>>> From: Chuck Lever III <chuck.lever@...cle.com>
>>>> Sent: Tuesday, 30 May 2023 16:28
>>>> To: Eli Cohen <elic@...dia.com>
>>>> Cc: Leon Romanovsky <leon@...nel.org>; Saeed Mahameed
>>>> <saeedm@...dia.com>; linux-rdma <linux-rdma@...r.kernel.org>; open
>>>> list:NETWORKING [GENERAL] <netdev@...r.kernel.org>; Thomas Gleixner
>>>> <tglx@...utronix.de>
>>>> Subject: Re: system hang on start-up (mlx5?)
>>>>
>>>>
>>>>
>>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@...cle.com>
>>>> wrote:
>>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@...utronix.de>
>>>> wrote:
>>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote:
>>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@...dia.com> wrote:
>>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove
>>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to
>>>>>>> reproduce the issue while the system is running.
>>>>>>>
>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> firmware version: 16.35.2000
>>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> Port module event: module 0, Cable plugged
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc:
>>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core
>> 0000:81:00.0:
>>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
>>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60
>> end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m-
>>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m-
>>>>> system_map=ffff9a33801990d0 end=236
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>> page fault for address: ffffffffb9ef3f80
>>>>>>> ###
>>>>>>>
>>>>>>> The fault address is the cm->managed_map for one of the CPUs.
>>>>>> That does not make any sense at all. The irq matrix is initialized via:
>>>>>>
>>>>>> irq_alloc_matrix()
>>>>>> m = kzalloc(sizeof(matric);
>>>>>> m->maps = alloc_percpu(*m->maps);
>>>>>>
>>>>>> So how is any per CPU map which got allocated there supposed to be
>>>>>> invalid (not mapped):
>>>>>>
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle
>>>> page fault for address: ffffffffb9ef3f80
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read
>>>> access in kernel mode
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF:
>> error_code(0x0000)
>>>> - not-present page
>>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D
>>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062
>>>>>> But if you look at the address: 0xffffffffb9ef3f80
>>>>>>
>>>>>> That one is bogus:
>>>>>>
>>>>>>    managed_map=ffff9a36efcf0f80
>>>>>>    managed_map=ffff9a36efd30f80
>>>>>>    managed_map=ffff9a3aefc30f80
>>>>>>    managed_map=ffff9a3aefc70f80
>>>>>>    managed_map=ffff9a3aefd30f80
>>>>>>    managed_map=ffff9a3aefd70f80
>>>>>>    managed_map=ffffffffb9ef3f80
>>>>>>
>>>>>> Can you spot the fail?
>>>>>>
>>>>>> The first six are in the direct map and the last one is in module map,
>>>>>> which makes no sense at all.
>>>>> Indeed. The reason for that is that the affinity mask has bits
>>>>> set for CPU IDs that are not present on my system.
>>>>>
>>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>>>>> that mask is set up like this:
>>>>>
>>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>>>>> {
>>>>>        struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>>>>> -       cpumask_var_t req_mask;
>>>>> +       struct irq_affinity_desc af_desc;
>>>>>        struct mlx5_irq *irq;
>>>>> -       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>>>>> -               return ERR_PTR(-ENOMEM);
>>>>> -       cpumask_copy(req_mask, cpu_online_mask);
>>>>> +       cpumask_copy(&af_desc.mask, cpu_online_mask);
>>>>> +       af_desc.is_managed = false;
>>>> By the way, why is "is_managed" set to false?
>>>>
>>>> This particular system is a NUMA system, and I'd like to be
>>>> able to set IRQ affinity for the card. Since is_managed is
>>>> set to false, writing to the /proc/irq files fails with EIO.
>>>>
>>> This is a control irq and is used for issuing configuration commands.
>>>
>>> This commit:
>>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f
>>> Author: Dou Liyang <douliyangs@...il.com>
>>> Date:   Tue Dec 4 23:51:21 2018 +0800
>>>
>>>     genirq/affinity: Add is_managed to struct irq_affinity_desc
>>>
>>> explains why it should not be managed.
>> Understood, but what about the other IRQs? I can't set any
>> of them. All writes to the proc files result in EIO.
>>
> I think @Shay Drory has a fix for that should go upstream.
> Shay was it sent?

The fix was send and merged.

https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/
>>>>> Which normally works as you would expect. But for some historical
>>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>>>>> cpumask_copy() misbehaves.
>>>>>
>>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>>>>> copy, this crash goes away. But mlx5_core crashes during a later
>>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>>>>> exactly the same thing (for_each_cpu() on an affinity mask created
>>>>> by copying), and crashes in a very similar fashion.
>>>>>
>>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>>>>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>>>>
>>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>>>>> when NR_CPUS is a small value (the default is 8192).
>>>>>
>>>>>
>>>>>> Can you please apply the debug patch below and provide the output?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>       tglx
>>>>>> ---
>>>>>> --- a/kernel/irq/matrix.c
>>>>>> +++ b/kernel/irq/matrix.c
>>>>>> @@ -51,6 +51,7 @@ struct irq_matrix {
>>>>>> unsigned int alloc_end)
>>>>>> {
>>>>>> struct irq_matrix *m;
>>>>>> + unsigned int cpu;
>>>>>>
>>>>>> if (matrix_bits > IRQ_MATRIX_BITS)
>>>>>> return NULL;
>>>>>> @@ -68,6 +69,8 @@ struct irq_matrix {
>>>>>> kfree(m);
>>>>>> return NULL;
>>>>>> }
>>>>>> + for_each_possible_cpu(cpu)
>>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned
>>>> long)per_cpu_ptr(m->maps, cpu));
>>>>>> return m;
>>>>>> }
>>>>>>
>>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir
>>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
>>>>>> unsigned int bit;
>>>>>>
>>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned
>>>> long)cm);
>>>>>> +
>>>>>> bit = matrix_alloc_area(m, cm, 1, true);
>>>>>> if (bit >= m->alloc_end)
>>>>>> goto cleanup;
>>>>> --
>>>>> Chuck Lever
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>