lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO7dBbVNv5NWRN6hXeo5rNEixn-ctmTLLn2KAKhEBYvvR+Du2w@mail.gmail.com>
Date: Mon, 25 Mar 2024 10:10:43 +0800
From: Tao Liu <ltao@...hat.com>
To: mrgolin@...zon.com, gal.pressman@...ux.dev, sleybo@...zon.com, 
	jgg@...pe.ca, leon@...nel.org
Cc: kexec@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	linux-rdma@...r.kernel.org
Subject: Implementing .shutdown method for efa module

Hi,

Recently I experienced a kernel panic which is related to efa module
when testing kexec -l && kexec -e to switch to a new kernel on AWS
i4g.16xlarge instance.

Here is the dmesg log:

[    6.379918] systemd[1]: Mounting FUSE Control File System...
[    6.381984] systemd[1]: Mounting Kernel Configuration File System...
[    6.383918] systemd[1]: Starting Apply Kernel Variables...
[    6.385430] systemd[1]: Started Journal Service.
[    6.394221] ACPI: bus type drm_connector registered
[    6.421408] systemd-journald[1263]: Received client request to
flush runtime journal.
[    7.262543] efa 0000:00:1b.0: enabling device (0010 -> 0012)
[    7.432420] efa 0000:00:1b.0: Setup irq:191 name:efa-mgmnt@pci:0000:00:1b.0
[    7.435581] efa 0000:00:1b.0 efa_0: IB device registered
[    7.885564] random: crng init done
[    8.139857] XFS (nvme0n1p2): Mounting V5 Filesystem
d7003ecc-db6f-4bfb-bf92-60376b6a6563
[    8.265233] XFS (nvme0n1p2): Ending clean mount
[   10.555612] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Red Hat Enterprise Linux 9.4 Beta (Plow)
Kernel 5.14.0-425.el9.aarch64 on an aarch64

ip-10-0-27-226 login: [   29.940381] kexec_core: Starting new kernel
[   30.079279] psci: CPU1 killed (polled 0 ms)
[   30.119222] psci: CPU2 killed (polled 0 ms)
[   30.199293] psci: CPU3 killed (polled 0 ms)
[   30.309214] psci: CPU4 killed (polled 0 ms)
[   30.379221] psci: CPU5 killed (polled 0 ms)
[   30.419210] psci: CPU6 killed (polled 0 ms)
[   30.489207] IRQ 191: no longer affine to CPU7
[   30.489667] psci: CPU7 killed (polled 0 ms)
.snip...
[   33.849123] psci: CPU63 killed (polled 0 ms)
[   33.849943] Bye!
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x413fd0c1]
[    0.000000] Linux version 5.14.0-417.el9.aarch64
(mockbuild@...64-025.build.eng.bos.redhat.com) (gcc (GCC) 11.4.1
20231218 (Red Hat 11.4.1-3), GNU ld version 2.35.2-42.el9) #1 SMP
PREEMPT_DYNAMIC Thu Feb 1 21:23:03 EST 2024
..snip...
[    1.012692] Freeing unused kernel memory: 6016K
[    2.370947] Checked W+X mappings: passed, no W+X pages found
[    2.370980] Run /init as init process
[    2.370982]   with arguments:
[    2.370983]     /init
[    2.370984]   with environment:
[    2.370984]     HOME=/
[    2.370985]     TERM=linux
[    2.373257] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[    2.373259] CPU: 1 PID: 1 Comm: init Not tainted 5.14.0-417.el9.aarch64 #1
[    2.382240] Hardware name: Amazon EC2 i4g.16xlarge/, BIOS 1.0 11/1/2018
[    2.383814] Call trace:
[    2.384410]  dump_backtrace+0xa8/0x120
[    2.385318]  show_stack+0x1c/0x30
[    2.386124]  dump_stack_lvl+0x74/0x8c
[    2.387011]  dump_stack+0x14/0x24
[    2.387810]  panic+0x158/0x368
[    2.388553]  do_exit+0x3a8/0x3b0
[    2.389333]  do_group_exit+0x38/0xa4
[    2.390195]  get_signal+0x7a4/0x810
[    2.391044]  do_signal+0x1bc/0x260
[    2.391870]  do_notify_resume+0x108/0x210
[    2.392839]  el0_da+0x154/0x160
[    2.393603]  el0t_64_sync_handler+0xdc/0x150
[    2.394628]  el0t_64_sync+0x17c/0x180
[    2.395513] SMP: stopping secondary CPUs
[    2.396483] Kernel Offset: 0x586f04e00000 from 0xffff800008000000
[    2.397934] PHYS_OFFSET: 0x40000000
[    2.398774] CPU features: 0x0,00000101,70020143,10417a0b
[    2.400042] Memory Limit: none
[    2.400783] ---[ end Kernel panic - not syncing: Attempted to kill
init! exitcode=0x0000000b ]---

In the dmesg log, I found "[   30.489207] IRQ 191: no longer affine to
CPU7" is suspicious, which is related to efa module. After blacklist
efa module from automatic loading when bootup, the kernel panic issue
doesn't appear again.

It looks to me it is due to the efa being not properly shutdown during
kexec, so the ongoing DMA/interrupts etc overwrite the memory range.

Though the issue is reproduced on rhel's kernel, the upstream kernel
[1] doesn't have the .shutdown method implemented either. Since I'm
not very familiar with the efa driver, could you please implement the
shutdown method in drivers/infiniband/hw/efa/efa_main.c? Thanks in
advance!

[1]: https://github.com/torvalds/linux/blob/master/drivers/infiniband/hw/efa/efa_main.c#L674

Thanks,
Tao Liu


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ