lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALs-Hst6fmesQy6g=ZxfKFs5W85quWQRcwP_=okBWKqMA3CQsw@mail.gmail.com>
Date:   Wed, 8 Nov 2023 08:47:03 -0800
From:   Evan Green <evan@...osinc.com>
To:     Marek Szyprowski <m.szyprowski@...sung.com>
Cc:     linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Palmer Dabbelt <palmer@...belt.com>,
        Albert Ou <aou@...s.berkeley.edu>,
        Conor Dooley <conor.dooley@...rochip.com>,
        Andrew Jones <ajones@...tanamicro.com>,
        Heiko Stuebner <heiko@...ech.de>,
        Jisheng Zhang <jszhang@...nel.org>,
        Clément Léger <cleger@...osinc.com>
Subject: Re: [PATCH] riscv: fix potential panic during CPU hot-plug

On Wed, Nov 8, 2023 at 3:29 AM Marek Szyprowski
<m.szyprowski@...sung.com> wrote:
>
> Commit 584ea6564bca ("RISC-V: Probe for unaligned access speed") added a
> test for unaligned access speed. It is being performed when given CPU is
> onlined. Then, another test for misaligned access emulation has been
> added in commit 71c54b3d169d ("riscv: report misaligned accesses
> emulation to hwprobe"). This unaligned access speed doesn't really change
> after the boot, so it is sufficient to do this test only once. This has
> been partially added by commit c20d36cc2a20 ("riscv: don't probe
> unaligned access speed if already done"), but this optimisation works
> only if RISCV_HWPROBE_MISALIGNED_EMULATED is returned by the latter
> check. Otherwise the 'misaligned_access_speed' pcpu varliable is
> overwritten with RISCV_HWPROBE_MISALIGNED_UNKNOWN value, what makes the
> first check to be always performed.
>
> Recently I've noticed that the first check introduced a regression in the
> CPU hot-plug mechanism. This can be observed as a following panic on
> QEmu:
>
> CPU1: off
> cpu1: Ratio of byte access time to unaligned word access is 7.00, unaligned accesses are fast
> CPU1: off
> CPU1: failed to come online
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 215 at kernel/smp.c:437 __flush_smp_call_function_queue+0x90/0x292
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Not tainted 6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : __flush_smp_call_function_queue+0x90/0x292
>  ra : smpcfd_dying_cpu+0xe/0x1c
> ...
> [<ffffffff800c904a>] __flush_smp_call_function_queue+0x90/0x292
> [<ffffffff800c9ae8>] smpcfd_dying_cpu+0xe/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> irq event stamp: 12335
> hardirqs last  enabled at (12335): [<ffffffff8007e542>] console_unlock+0x156/0x186
> hardirqs last disabled at (12334): [<ffffffff8007e52c>] console_unlock+0x140/0x186
> softirqs last  enabled at (12154): [<ffffffff80a560d0>] __do_softirq+0x3b0/0x470
> softirqs last disabled at (12147): [<ffffffff80019340>] __irq_exit_rcu+0xa6/0xd0
> ---[ end trace 0000000000000000 ]---
> ------------[ cut here ]------------
> kernel BUG at kernel/irq_work.c:245!
> Kernel BUG [#1]
> Modules linked in:
> CPU: 0 PID: 215 Comm: bash Tainted: G        W          6.5.0-rc1+ #7524
> Hardware name: riscv-virtio,qemu (DT)
> epc : irq_work_run_list+0x30/0x32
>  ra : irq_work_run+0x2a/0x4c
> ...
> [<ffffffff8011a346>] irq_work_run_list+0x30/0x32
> [<ffffffff8011a372>] irq_work_run+0x2a/0x4c
> [<ffffffff800c9aec>] smpcfd_dying_cpu+0x12/0x1c
> [<ffffffff80012b2e>] cpuhp_invoke_callback+0x124/0x322
> [<ffffffff800142f8>] _cpu_up+0x218/0x322
> [<ffffffff8001445e>] cpu_up+0x5c/0x98
> [<ffffffff80014bb8>] cpu_device_up+0x14/0x1c
> [<ffffffff8066d68e>] cpu_subsys_online+0x10/0x18
> [<ffffffff806680e8>] device_online+0x56/0x72
> [<ffffffff80668190>] online_store+0x8c/0xae
> [<ffffffff80662dde>] dev_attr_store+0xe/0x1a
> [<ffffffff802b53bc>] sysfs_kf_write+0x2e/0x4c
> [<ffffffff802b4750>] kernfs_fop_write_iter+0xe8/0x14e
> [<ffffffff80230170>] vfs_write+0x2a4/0x3e0
> [<ffffffff802303e8>] ksys_write+0x5e/0xc8
> [<ffffffff80230460>] sys_write+0xe/0x16
> [<ffffffff80a4a8de>] do_trap_ecall_u+0xe0/0xf4
> [<ffffffff80003e8c>] ret_from_exception+0x0/0x64
> Code: 8526 6084 f0ef f71f fce5 60e2 6442 64a2 6105 8082 (9002) 1101
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> SMP: stopping secondary CPUs
> SMP: failed to stop secondary CPUs 0-1
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> To avoid calling those checks in the CPU hot-plug paths again and again,
> simply move the check at the beginning of the check_unaligned_access()
> function and rely on the value determined during the system boot.
>
> Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
> Fixes: c20d36cc2a20 ("riscv: don't probe unaligned access speed if already done")
> Fixes: 71c54b3d169d ("riscv: report misaligned accesses emulation to hwprobe")
> Signed-off-by: Marek Szyprowski <m.szyprowski@...sung.com>

Hi Marek,
Thanks for the patch. I happened to spot that bug too
(check_unaligned_access_emulated() clobbering the per-cpu variable
back to unknown). Since I was rearranging that code to try to run the
speed measurements in parallel, I moved that check around to hopefully
solve the same issue you're reporting. Can you see if this patch also
fixes your issue:
https://lore.kernel.org/lkml/20231106225855.3121724-1-evan@rivosinc.com/
. It's also in Palmer's for-next tree as 55e0bf49a0d0 ("RISC-V: Probe
misaligned access speed in parallel").

-Evan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ