[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <C34181E7-A515-4BD1-8C38-CB8BCF2D987D@oracle.com>
Date: Tue, 30 May 2023 21:48:32 +0000
From: Chuck Lever III <chuck.lever@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: Eli Cohen <elic@...dia.com>, Leon Romanovsky <leon@...nel.org>,
Saeed
Mahameed <saeedm@...dia.com>,
linux-rdma <linux-rdma@...r.kernel.org>,
"open
list:NETWORKING [GENERAL]" <netdev@...r.kernel.org>
Subject: Re: system hang on start-up (mlx5?)
> On May 30, 2023, at 3:46 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> Chuck!
>
> On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:
>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> But if you look at the address: 0xffffffffb9ef3f80
>>>
>>> That one is bogus:
>>>
>>> managed_map=ffff9a36efcf0f80
>>> managed_map=ffff9a36efd30f80
>>> managed_map=ffff9a3aefc30f80
>>> managed_map=ffff9a3aefc70f80
>>> managed_map=ffff9a3aefd30f80
>>> managed_map=ffff9a3aefd70f80
>>> managed_map=ffffffffb9ef3f80
>>>
>>> Can you spot the fail?
>>>
>>> The first six are in the direct map and the last one is in module map,
>>> which makes no sense at all.
>>
>> Indeed. The reason for that is that the affinity mask has bits
>> set for CPU IDs that are not present on my system.
>
> Which I don't buy, but even if so then this should not make
> for_each_cpu() go south. See below.
>
>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
>> that mask is set up like this:
>>
>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
>> {
>> struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
>> - cpumask_var_t req_mask;
>> + struct irq_affinity_desc af_desc;
>
> That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...
>
>> struct mlx5_irq *irq;
>> - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
>> - return ERR_PTR(-ENOMEM);
>> - cpumask_copy(req_mask, cpu_online_mask);
>> + cpumask_copy(&af_desc.mask, cpu_online_mask);
>> + af_desc.is_managed = false;
>>
>> Which normally works as you would expect. But for some historical
>> reason, I have CONFIG_NR_CPUS=32 on my system, and the
>> cpumask_copy() misbehaves.
>>
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away. But mlx5_core crashes during a later
>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does
>> exactly the same thing (for_each_cpu() on an affinity mask created
>> by copying), and crashes in a very similar fashion.
>>
>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
>> vanishes entirely, and "modprobe mlx5_core" works as expected.
>>
>> Thus I think the problem is with cpumask_copy() or for_each_cpu()
>> when NR_CPUS is a small value (the default is 8192).
>
> I don't buy any of this. Here is why:
>
> cpumask_copy(d, s)
> bitmap_copy(d, s, nbits = 32)
> len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
>
> So it copies as many longs as required to cover nbits, i.e. it copies
> any clobbered bits beyond nbits too. While that looks odd at the first
> glance, that's just an optimization which is harmless.
>
> for_each_cpu() finds the next set bit in a mask and breaks the loop once
> bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.
>
> I just booted a kernel with NR_CPUS=32:
My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.
Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.
> [ 0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
> [ 0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
> ...
> [ 0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1
>
> and added a function which does:
>
> struct cpumask ma, mb;
> int cpu;
>
> memset(&ma, 0xaa, sizeof(ma);
> cpumask_copy(&mb, &ma);
> pr_info("MASKBITS: %016lx\n", mb.bits[0]);
> pr_info("CPUs:");
> for_each_cpu(cpu, &mb)
> pr_cont(" %d", cpu);
> pr_cont("\n");
>
> [ 2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
> [ 2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
>
> and the same exercise with a 0x55 filled source bitmap.
>
> [ 2.167595] smp: MASKBITS: 0x5555555555555555
> [ 2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
>
> So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
> the right thing simple because of this:
>
> nr_cpu_ids is 32, right?
>
> for_each_cpu(cpu, mask)
> for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
> for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
>
> So if find_next_bit() returns a bit after bit 31 the condition (bit) <
> (size) will terminate the loop, right?
Again, you are assuming more CPUs than there are bits in the mask.
> Also in the case of that driver the copy is _NOT_ copying set bits
> beyond bit 31 simply because the source mask is cpu_online_mask which
> definitely does not have a bit set which is greater than 31. As the copy
> copies longs the resulting mask in af_desc.mask cannot have any bit set
> past bit 31 either.
>
> If the above is not correct, then there is a bigger problem than that
> MLX5 driver crashing.
>
> So this:
>
>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
>> copy, this crash goes away.
>
> does not make any sense to me.
>
> Can you please add after the cpumask_copy() in that mlx5 code:
>
> pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
> pr_info("MASKBITS: %016lx\n", af_desc.mask.bits[0]);
Both are 0000 0000 0000 0fff, as expected on a system
where 12 CPUs are present.
> Please print also in irq_matrix_reserve_managed():
>
> - @mask->bits[0]
> - nr_cpu_ids
> - the CPU numbers inside the for_each_cpu() loop
Here's where it gets interesting:
+ pr_info("%s: MASKBITS: %016lx\n", __func__, msk->bits[0]);
+ pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);
[ 70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[ 70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 71.260555][ C9] port_module: 1 callbacks suppressed
[ 71.260561][ C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[ 71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8
[ 71.274096][ T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[ 71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[ 71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[ 71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[ 71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[ 71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[ 71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[ 71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[ 71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12
Notice that there are in fact higher bits set than bit 11.
The lowest 16 bits of MASKBITS are bcd8, or in binary:
... 1011 1100 1101 1000
Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.
--
Chuck Lever
Powered by blists - more mailing lists