[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAE4VaGBQnMp953tsv13s=CiaaiW+EZNuvh6dCuRA7MWbyU_Hsw@mail.gmail.com>
Date: Wed, 18 Jun 2025 13:34:11 +0200
From: Jirka Hladky <jhladky@...hat.com>
To: Abhigyan ghosh <zscript.team.zs@...il.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [BUG] Kernel panic in __migrate_swap_task() on 6.16-rc2 (NULL
pointer dereference)
Hi Abhigyan,
The testing is done on bare metal. The kernel panics occur after
several hours of benchmarking.
Out of 20 servers, the problem has occurred on 6 of them:
intel-sapphire-rapids-gold-6448y-2s
intel-emerald-rapids-platinum-8558-2s
amd-epyc5-turin-9655p-1s
amd-epyc4-zen4c-bergamo-9754-1s
amd-epyc3-milan-7713-2s
intel-skylake-2s
The number in the name is the CPU model. 1s: single socket, 2s: dual socket.
We were not able to find a clear pattern. It appears to be a race
condition of some kind.
We run various performance benchmarks, including Linpack, Stream, NAS
(https://www.nas.nasa.gov/software/npb.html), and Stress-ng. Testing
is conducted with various thread counts and settings. All benchmarks
together are running ~24 hours. One benchmark takes ~4 hours. Please
also note that we repeat the benchmarks to collect performance
statistics. In many cases, kernel panic has occurred when the
benchmark was repeated.
Crash occurred while running these tests:
Stress_ng: Starting test 'fork' (#29 out of 41), number of threads 32,
iteration 1 out of 5
SPECjbb2005: Starting DEFAULT run with 4 SPECJBB2005 instances, each
with 24 warehouses, iteration 2 out of 3
Stress_ng: test 'sem' (#30 out of 41), number of threads 24, iteration
2 out of 5
Stress_ng: test 'sem' (#30 out of 41), number of threads 64, iteration
4 out of 5
SPECjbb2005: SINGLE run with 1 SPECJBB2005 instances, each with 128
warehouses, iteration 2 out of 3
Linpack: Benchmark-utils/linpackd, iteration 3, testType affinityRun,
number of threads 128
NAS: NPB_sources/bin/is.D.x
There is no clear benchmark triggering the kernel panic. Looping
Stress_ng's sem test looks, however, like it's worth trying.
I hope this helps. Please let me know if there's anything I can help
with to pinpoint the problem.
Thanks
Jirka
On Wed, Jun 18, 2025 at 7:19 AM Abhigyan ghosh
<zscript.team.zs@...il.com> wrote:
>
> Hi Jirka,
>
> Thanks for the detailed report.
>
> I'm curious about the specific setup in which this panic was triggered. Could you share more about the exact configuration or parameters you used for running `stress-ng` or Linpack? For instance:
>
> - How many threads/cores were used?
> - Was it running inside a VM, container, or bare-metal?
> - Was this under any thermal throttling or power-saving mode?
>
> I'd like to try reproducing it locally to study the failure further.
>
> Best regards,
> Abhigyan Ghosh
>
> On 18 June 2025 1:35:30 am IST, Jirka Hladky <jhladky@...hat.com> wrote:
> >Hi all,
> >
> >I’ve encountered a reproducible kernel panic on 6.16-rc1 and 6.16-rc2
> >involving a NULL pointer dereference in `__migrate_swap_task()` during
> >CPU migration. This occurred on various AMD and Intel systems while
> >running a CPU-intensive workload (Linpack, Stress_ng - it's not
> >specific to a benchmark).
> >
> >Full trace below:
> >---
> >BUG: kernel NULL pointer dereference, address: 00000000000004c8
> >#PF: supervisor read access in kernel mode
> >#PF: error_code(0x0000) - not-present page
> >PGD 4078b99067 P4D 4078b99067 PUD 0
> >Oops: Oops: 0000 [#1] SMP NOPTI
> >CPU: 74 UID: 0 PID: 466 Comm: migration/74 Kdump: loaded Not tainted
> >6.16.0-0.rc2.24.eln149.x86_64 #1 PREEMPT(lazy)
> >Hardware name: GIGABYTE R182-Z91-00/MZ92-FS0-00, BIOS M07 09/03/2021
> >Stopper: multi_cpu_stop+0x0/0x130 <- migrate_swap+0xa7/0x120
> >RIP: 0010:__migrate_swap_task+0x2f/0x170
> >Code: 41 55 4c 63 ee 41 54 55 53 48 89 fb 48 83 87 a0 04 00 00 01 65
> >48 ff 05 e7 14 dd 02 48 8b af 50 0a 00 00 66 90 e8 61 93 07 00 <48> 8b
> >bd c8 04 00 00 e8 85 11 35 00 48 85 c0 74 12 ba 01 00 00 00
> >RSP: 0018:ffffce79cd90bdd0 EFLAGS: 00010002
> >RAX: 0000000000000001 RBX: ffff8e9c7290d1c0 RCX: 0000000000000000
> >RDX: ffff8e9c71e83680 RSI: 000000000000001b RDI: ffff8e9c7290d1c0
> >RBP: 0000000000000000 R08: 00056e36392913e7 R09: 00000000002ab980
> >R10: ffff8eac2fcb13c0 R11: ffff8e9c77997410 R12: ffff8e7c2fcf12c0
> >R13: 000000000000001b R14: ffff8eac71eda944 R15: ffff8eac71eda944
> >FS: 0000000000000000(0000) GS:ffff8eac9db4a000(0000) knlGS:0000000000000000
> >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >CR2: 00000000000004c8 CR3: 0000003072388003 CR4: 0000000000f70ef0
> >PKRU: 55555554
> >Call Trace:
> > <TASK>
> > migrate_swap_stop+0xe8/0x190
> > multi_cpu_stop+0xf3/0x130
> > ? __pfx_multi_cpu_stop+0x10/0x10
> > cpu_stopper_thread+0x97/0x140
> > ? __pfx_smpboot_thread_fn+0x10/0x10
> > smpboot_thread_fn+0xf3/0x220
> > kthread+0xfc/0x240
> > ? __pfx_kthread+0x10/0x10
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0xf0/0x110
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1a/0x30
> > </TASK>
> >---
> >
> >**Kernel Version:**
> >6.16.0-0.rc2.24.eln149.x86_64 (Fedora rawhide)
> >https://koji.fedoraproject.org/koji/buildinfo?buildID=2732950
> >
> >**Reproducibility:**
> >Happened multiple times during routine CPU-intensive operations. It
> >happens with various benchmarks (Stress_ng, Linpack) after several
> >hours of performance testing. `migration/*` kernel threads hit a NULL
> >dereference in `__migrate_swap_task`.
> >
> >**System Info:**
> >- Platform: GIGABYTE R182-Z91-00 (dual socket EPYC)
> >- BIOS: M07 09/03/2021
> >- Config: Based on Fedora’s debug kernel (`PREEMPT(lazy)`)
> >
> >**Crash Cause (tentative):**
> >NULL dereference at offset `0x4c8` from a task struct pointer in
> >`__migrate_swap_task`. Possibly an uninitialized or freed
> >`task_struct` field.
> >
> >Please let me know if you’d like me to test a patch or if you need
> >more details.
> >
> >Thanks,
> >Jirka
> >
> >
>
> aghosh
>
--
-Jirka
Powered by blists - more mailing lists