lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <9d793d9f-0fca-2b0d-2a2e-abd527ffa8d4@nvidia.com> Date: Tue, 30 May 2023 18:08:21 +0300 From: Shay Drory <shayd@...dia.com> To: Eli Cohen <elic@...dia.com>, Chuck Lever III <chuck.lever@...cle.com> CC: Leon Romanovsky <leon@...nel.org>, Saeed Mahameed <saeedm@...dia.com>, linux-rdma <linux-rdma@...r.kernel.org>, "open list:NETWORKING [GENERAL]" <netdev@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de> Subject: Re: system hang on start-up (mlx5?) On 30/05/2023 16:54, Eli Cohen wrote: >> -----Original Message----- >> From: Chuck Lever III <chuck.lever@...cle.com> >> Sent: Tuesday, 30 May 2023 16:51 >> To: Eli Cohen <elic@...dia.com> >> Cc: Shay Drory <shayd@...dia.com>; Leon Romanovsky <leon@...nel.org>; >> Saeed Mahameed <saeedm@...dia.com>; linux-rdma <linux- >> rdma@...r.kernel.org>; open list:NETWORKING [GENERAL] >> <netdev@...r.kernel.org>; Thomas Gleixner <tglx@...utronix.de> >> Subject: Re: system hang on start-up (mlx5?) >> >> >> >>> On May 30, 2023, at 9:48 AM, Eli Cohen <elic@...dia.com> wrote: >>> >>>> From: Chuck Lever III <chuck.lever@...cle.com> >>>> Sent: Tuesday, 30 May 2023 16:28 >>>> To: Eli Cohen <elic@...dia.com> >>>> Cc: Leon Romanovsky <leon@...nel.org>; Saeed Mahameed >>>> <saeedm@...dia.com>; linux-rdma <linux-rdma@...r.kernel.org>; open >>>> list:NETWORKING [GENERAL] <netdev@...r.kernel.org>; Thomas Gleixner >>>> <tglx@...utronix.de> >>>> Subject: Re: system hang on start-up (mlx5?) >>>> >>>> >>>> >>>>> On May 30, 2023, at 9:09 AM, Chuck Lever III <chuck.lever@...cle.com> >>>> wrote: >>>>>> On May 29, 2023, at 5:20 PM, Thomas Gleixner <tglx@...utronix.de> >>>> wrote: >>>>>> On Sat, May 27 2023 at 20:16, Chuck Lever, III wrote: >>>>>>>> On May 7, 2023, at 1:31 AM, Eli Cohen <elic@...dia.com> wrote: >>>>>>> I can boot the system with mlx5_core deny-listed. I log in, remove >>>>>>> mlx5_core from the deny list, and then "modprobe mlx5_core" to >>>>>>> reproduce the issue while the system is running. >>>>>>> >>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core >> 0000:81:00.0: >>>> firmware version: 16.35.2000 >>>>>>> May 27 15:47:45 manet.1015granger.net kernel: mlx5_core >> 0000:81:00.0: >>>> 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: >>>> pool=ffff9a3718e56180 i=0 af_desc=ffffb6c88493fc90 >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefcf0f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefcf0f60 end=236 >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core >> 0000:81:00.0: >>>> Port module event: module 0, Cable plugged >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_irq_alloc: >>>> pool=ffff9a3718e56180 i=1 af_desc=ffffb6c88493fc60 >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: mlx5_core >> 0000:81:00.0: >>>> mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W). >>>>>>> May 27 15:47:46 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efcf0f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efcf0f60 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a36efd30f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a36efd30f60 >> end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc30f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc30f60 >> end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefc70f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefc70f60 >> end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd30f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd30f60 >> end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffff9a3aefd70f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->alloc_map=ffff9a3aefd70f60 >> end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: matrix_alloc_area: m- >>>>> scratch_map=ffff9a33801990b0 cm->managed_map=ffffffffb9ef3f80 m- >>>>> system_map=ffff9a33801990d0 end=236 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle >>>> page fault for address: ffffffffb9ef3f80 >>>>>>> ### >>>>>>> >>>>>>> The fault address is the cm->managed_map for one of the CPUs. >>>>>> That does not make any sense at all. The irq matrix is initialized via: >>>>>> >>>>>> irq_alloc_matrix() >>>>>> m = kzalloc(sizeof(matric); >>>>>> m->maps = alloc_percpu(*m->maps); >>>>>> >>>>>> So how is any per CPU map which got allocated there supposed to be >>>>>> invalid (not mapped): >>>>>> >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: BUG: unable to handle >>>> page fault for address: ffffffffb9ef3f80 >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: supervisor read >>>> access in kernel mode >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: #PF: >> error_code(0x0000) >>>> - not-present page >>>>>>> May 27 15:47:47 manet.1015granger.net kernel: PGD 54ec19067 P4D >>>> 54ec19067 PUD 54ec1a063 PMD 482b83063 PTE 800ffffab110c062 >>>>>> But if you look at the address: 0xffffffffb9ef3f80 >>>>>> >>>>>> That one is bogus: >>>>>> >>>>>> managed_map=ffff9a36efcf0f80 >>>>>> managed_map=ffff9a36efd30f80 >>>>>> managed_map=ffff9a3aefc30f80 >>>>>> managed_map=ffff9a3aefc70f80 >>>>>> managed_map=ffff9a3aefd30f80 >>>>>> managed_map=ffff9a3aefd70f80 >>>>>> managed_map=ffffffffb9ef3f80 >>>>>> >>>>>> Can you spot the fail? >>>>>> >>>>>> The first six are in the direct map and the last one is in module map, >>>>>> which makes no sense at all. >>>>> Indeed. The reason for that is that the affinity mask has bits >>>>> set for CPU IDs that are not present on my system. >>>>> >>>>> After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") >>>>> that mask is set up like this: >>>>> >>>>> struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) >>>>> { >>>>> struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); >>>>> - cpumask_var_t req_mask; >>>>> + struct irq_affinity_desc af_desc; >>>>> struct mlx5_irq *irq; >>>>> - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) >>>>> - return ERR_PTR(-ENOMEM); >>>>> - cpumask_copy(req_mask, cpu_online_mask); >>>>> + cpumask_copy(&af_desc.mask, cpu_online_mask); >>>>> + af_desc.is_managed = false; >>>> By the way, why is "is_managed" set to false? >>>> >>>> This particular system is a NUMA system, and I'd like to be >>>> able to set IRQ affinity for the card. Since is_managed is >>>> set to false, writing to the /proc/irq files fails with EIO. >>>> >>> This is a control irq and is used for issuing configuration commands. >>> >>> This commit: >>> commit c410abbbacb9b378365ba17a30df08b4b9eec64f >>> Author: Dou Liyang <douliyangs@...il.com> >>> Date: Tue Dec 4 23:51:21 2018 +0800 >>> >>> genirq/affinity: Add is_managed to struct irq_affinity_desc >>> >>> explains why it should not be managed. >> Understood, but what about the other IRQs? I can't set any >> of them. All writes to the proc files result in EIO. >> > I think @Shay Drory has a fix for that should go upstream. > Shay was it sent? The fix was send and merged. https://lore.kernel.org/all/20230523054242.21596-15-saeed@kernel.org/ >>>>> Which normally works as you would expect. But for some historical >>>>> reason, I have CONFIG_NR_CPUS=32 on my system, and the >>>>> cpumask_copy() misbehaves. >>>>> >>>>> If I correct mlx5_ctrl_irq_request() to clear @af_desc before the >>>>> copy, this crash goes away. But mlx5_core crashes during a later >>>>> part of its init, in cpu_rmap_update(). cpu_rmap_update() does >>>>> exactly the same thing (for_each_cpu() on an affinity mask created >>>>> by copying), and crashes in a very similar fashion. >>>>> >>>>> If I set CONFIG_NR_CPUS to a larger value, like 512, the problem >>>>> vanishes entirely, and "modprobe mlx5_core" works as expected. >>>>> >>>>> Thus I think the problem is with cpumask_copy() or for_each_cpu() >>>>> when NR_CPUS is a small value (the default is 8192). >>>>> >>>>> >>>>>> Can you please apply the debug patch below and provide the output? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> tglx >>>>>> --- >>>>>> --- a/kernel/irq/matrix.c >>>>>> +++ b/kernel/irq/matrix.c >>>>>> @@ -51,6 +51,7 @@ struct irq_matrix { >>>>>> unsigned int alloc_end) >>>>>> { >>>>>> struct irq_matrix *m; >>>>>> + unsigned int cpu; >>>>>> >>>>>> if (matrix_bits > IRQ_MATRIX_BITS) >>>>>> return NULL; >>>>>> @@ -68,6 +69,8 @@ struct irq_matrix { >>>>>> kfree(m); >>>>>> return NULL; >>>>>> } >>>>>> + for_each_possible_cpu(cpu) >>>>>> + pr_info("ALLOC: CPU%03u: %016lx\n", cpu, (unsigned >>>> long)per_cpu_ptr(m->maps, cpu)); >>>>>> return m; >>>>>> } >>>>>> >>>>>> @@ -215,6 +218,8 @@ int irq_matrix_reserve_managed(struct ir >>>>>> struct cpumap *cm = per_cpu_ptr(m->maps, cpu); >>>>>> unsigned int bit; >>>>>> >>>>>> + pr_info("RESERVE MANAGED: CPU%03u: %016lx\n", cpu, (unsigned >>>> long)cm); >>>>>> + >>>>>> bit = matrix_alloc_area(m, cm, 1, true); >>>>>> if (bit >= m->alloc_end) >>>>>> goto cleanup; >>>>> -- >>>>> Chuck Lever >>>> >>>> -- >>>> Chuck Lever >> >> -- >> Chuck Lever >>
Powered by blists - more mailing lists