[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAE4VaGAaPjczTYZh8sG47KmAL361cO6dtzfi+F7mufbj5Q+5ag@mail.gmail.com>
Date: Wed, 18 Jun 2025 22:37:31 +0200
From: Jirka Hladky <jhladky@...hat.com>
To: Abhigyan ghosh <zscript.team.zs@...il.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: Kernel panic in __migrate_swap_task() – more questions
Hi Abhigyan,
thank you for looking into this!
> 1. Were you using Fedora's debug kernels (CONFIG_DEBUG_*, CONFIG_KASAN, etc.), or are these closer to production-style stripped builds?
$grep CONFIG_DEBUG_ kernel-6.16.0-0.rc2.24.eln149.x86_64.config | grep =y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_DEBUG_INFO_BTF_MODULES=y
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
CONFIG_DEBUG_WX=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_BOOT_PARAMS=y
$grep CONFIG_KASAN kernel-6.16.0-0.rc2.24.eln149.x86_64.config
# CONFIG_KASAN is not set
Kernel build is here:
https://koji.fedoraproject.org/koji/buildinfo?buildID=2732950
To get kernel config, download
https://kojipkgs.fedoraproject.org//packages/kernel/6.16.0/0.rc2.24.eln149/x86_64/kernel-core-6.16.0-0.rc2.24.eln149.x86_64.rpm,
unpack it and check /lib/modules/6.16.0-0.rc2.24.eln149.x86_64/config
> 2. For the crashing systems (especially the EPYC ones), did you observe any particular NUMA layout or memory pressure signs prior to the crash?
No, not really. The tests are running fully automatically, and I don't
see anything unusual in the logs before the kernel panic. Example:
[58447.906402] Tue Jun 17 23:49:12 CEST 2025 Completed in 23s
[58447.930818] sockfd: 1038050.164394 bogo-ops-per-second-real-time
[58448.473983] runtest.sh (545855): drop_caches: 3
[58448.489326] Tue Jun 17 23:49:12 CEST 2025 Starting test 'mmapmany'
(#27 out of 41), number of threads 24, iteration 2 out of 5
[58473.589610] Tue Jun 17 23:49:38 CEST 2025 Completed in 26s
[58473.613499] mmapmany: 904046.369461 bogo-ops-per-second-real-time
[58474.158233] runtest.sh (545855): drop_caches: 3
[58474.173944] Tue Jun 17 23:49:38 CEST 2025 Starting test 'mmap' (#28
out of 41), number of threads 24, iteration 2 out of 5
[58493.524125] restraintd[1960]: *** Current Time: Tue Jun 17 23:49:59
2025 Localwatchdog at: Thu Jun 19 20:49:00 2025
[-- MARK -- Tue Jun 17 21:50:00 2025]
[58497.412206] Tue Jun 17 23:50:01 CEST 2025 Completed in 23s
[58497.459789] mmap: 196.528701 bogo-ops-per-second-real-time
[58498.003368] runtest.sh (545855): drop_caches: 3
[58498.018847] Tue Jun 17 23:50:02 CEST 2025 Starting test 'fork' (#29
out of 41), number of threads 24, iteration 2 out of 5
[58521.139714] Tue Jun 17 23:50:25 CEST 2025 Completed in 23s
[58521.164051] fork: 34719.527382 bogo-ops-per-second-real-time
[58521.717218] runtest.sh (545855): drop_caches: 3
[58521.732624] Tue Jun 17 23:50:26 CEST 2025 Starting test 'sem' (#30
out of 41), number of threads 24, iteration 2 out of 5
[58544.844994] BUG: kernel NULL pointer dereference, address: 00000000000004c8
> 3. You mentioned repetition often triggered it — did you happen to try pinning stress-ng using --taskset or restricting cpusets to see if that changes the outcome?
We are not pinning stress-ng. We run approximately 40 different
stress-ng subtests, each lasting 23 seconds and with a varying number
of threads. The entire setup is iterated 5 times to collect reliable
statistics and to determine the noise in the results.
For example, take this test, which leads to a kernel panic:
Starting test 'sem' (#30 out of 41), number of threads 64, iteration 4 out of 5
It was running fine 3 times, and only on the 4th iteration, a kernel
panic occurred. This was 16 hours after tests started. Before
Stress_ng, the following tests completed fine on this server:
NAS, Linpack, SPECjbb2005
On other servers, crashes appeared sooner, for example, while running
NAS or SPECjbb2005.
The kernel panic occurs quite rarely, around once in 20 hours. I know
it might not be easy to reproduce this.
Keeping my fingers crossed!
Jirka
On Wed, Jun 18, 2025 at 6:42 PM Abhigyan ghosh
<zscript.team.zs@...il.com> wrote:
>
> Hello Jirka!,
>
> Thank you so much for the detailed breakdown — this helps a lot.
>
> Just a couple of quick follow-ups to better understand the environment:
>
> 1. Were you using Fedora's debug kernels (CONFIG_DEBUG_*, CONFIG_KASAN, etc.), or are these closer to production-style stripped builds?
>
>
> 2. For the crashing systems (especially the EPYC ones), did you observe any particular NUMA layout or memory pressure signs prior to the crash?
>
>
> 3. You mentioned repetition often triggered it — did you happen to try pinning stress-ng using --taskset or restricting cpusets to see if that changes the outcome?
>
>
>
> I'll try reproducing locally by looping stress-ng --sem under perf to trace any irregularities.
>
> Appreciate your time!
>
> Best regards,
> Abhigyan Ghosh
> zscript.team.zs@...il.com
> zsml.zscript.org
> aghosh
>
--
-Jirka
Powered by blists - more mailing lists