linux-kernel - Re: [BUG] Kernel panic in __migrate_swap_task() on 6.16-rc2 (NULL pointer dereference)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAE4VaGC2ti=91CE6qsDPkJgUa6vxnMzuNGMXUnj+CxW-9OfKdQ@mail.gmail.com>
Date: Fri, 27 Jun 2025 12:48:10 +0200
From: Jirka Hladky <jhladky@...hat.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Abhigyan ghosh <zscript.team.zs@...il.com>, linux-kernel@...r.kernel.org, 
	Chen Yu <yu.chen.surf@...mail.com>
Subject: Re: [BUG] Kernel panic in __migrate_swap_task() on 6.16-rc2 (NULL
 pointer dereference)

Hi Chenyu,

thank you for the patch! I will test it and get back to you next week.

The issue is quite challenging to reproduce, and it indeed points to a
race condition. I see different benchmarks running when hitting the
kernel panic.

As for stress_ng, we run the subtests like this:

sync; sync; echo 3 > /proc/sys/vm/drop_caches
./stress-ng --fork 24 --verbose --oomable --metrics-brief -t 23 --yaml
$(uname -r)_fork.yaml | tee $(uname -r)_fork.log

We vary the number of threads (in the example above, 24 threads) up to
the maximum number of available CPUs and repeat the tests several
times to record the runtime statistics and variations.

Try to run the test on several servers in parallel to increase the
chances of hitting the problem in a reasonable time.

Thank you
Jirka

On Fri, Jun 27, 2025 at 9:16 AM Chen, Yu C <yu.c.chen@...el.com> wrote:
>
> Hi Jirka,
>
> On 6/27/2025 5:46 AM, Jirka Hladky wrote:
> > Hi Chen and all,
> >
> > we have now verified that the following commit causes a kernel panic
> > discussed in this thread:
> >
> > ad6b26b6a0a79 sched/numa: add statistics of numa balance task
> >
> > Reverting this commit fixes the issue.
> >
> > I'm happy to help debug this further or test a proposed fix.
> >
>
> Thanks very much for your report, it seems that there is a
> race condition that when the swap task candidate was chosen,
> but its mm_struct get released due to task exit, then later
> when doing the task swaping, the p->mm is NULL which caused
> the problem:
>
> CPU0                                   CPU1
> :
> ...
> task_numa_migrate
>    task_numa_find_cpu
>     task_numa_compare
>       # a normal task p is chosen
>       env->best_task = p
>
>                                         # p exit:
>                                         exit_signals(p);
>                                            p->flags |= PF_EXITING
>                                         exit_mm
>                                            p->mm = NULL;
>
>     migrate_swap_stop
>       __migrate_swap_task((arg->src_task, arg->dst_cpu)
>        count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
>
> Could you please help check if the following debug patch works,
> and if there is no issue found after you ran several tests,
> could you please provide the
> /sys/kernel/debug/tracing/trace
>
> BTW, is it possible to share your test script for stress-ng,
> stream? It looks like the stress-ng's fork test case would
> trigger this issue easier in theory.
>
> thanks,
> Chenyu
>
>
> > Thank you!
> > Jirka
> >
> > On Wed, Jun 18, 2025 at 1:34 PM Jirka Hladky <jhladky@...hat.com> wrote:
> >>
> >> Hi Abhigyan,
> >>
> >> The testing is done on bare metal. The kernel panics occur after
> >> several hours of benchmarking.
> >>
> >> Out of 20 servers, the problem has occurred on 6 of them:
> >> intel-sapphire-rapids-gold-6448y-2s
> >> intel-emerald-rapids-platinum-8558-2s
> >> amd-epyc5-turin-9655p-1s
> >> amd-epyc4-zen4c-bergamo-9754-1s
> >> amd-epyc3-milan-7713-2s
> >> intel-skylake-2s
> >>
> >> The number in the name is the CPU model. 1s: single socket, 2s: dual socket.
> >>
> >> We were not able to find a clear pattern. It appears to be a race
> >> condition of some kind.
> >>
> >> We run various performance benchmarks, including Linpack, Stream, NAS
> >> (https://www.nas.nasa.gov/software/npb.html), and Stress-ng. Testing
> >> is conducted with various thread counts and settings. All benchmarks
> >> together are running ~24 hours. One benchmark takes ~4 hours. Please
> >> also note that we repeat the benchmarks to collect performance
> >> statistics. In many cases, kernel panic has occurred when the
> >> benchmark was repeated.
> >>
> >> Crash occurred while running these tests:
> >> Stress_ng: Starting test 'fork' (#29 out of 41), number of threads 32,
> >> iteration 1 out of 5
> >> SPECjbb2005: Starting DEFAULT run with 4 SPECJBB2005 instances, each
> >> with 24 warehouses, iteration 2 out of 3
> >> Stress_ng: test 'sem' (#30 out of 41), number of threads 24, iteration
> >> 2 out of 5
> >> Stress_ng: test 'sem' (#30 out of 41), number of threads 64, iteration
> >> 4 out of 5
> >> SPECjbb2005: SINGLE run with 1 SPECJBB2005 instances, each with 128
> >> warehouses, iteration 2 out of 3
> >> Linpack: Benchmark-utils/linpackd, iteration 3, testType affinityRun,
> >> number of threads 128
> >> NAS: NPB_sources/bin/is.D.x
> >>
> >> There is no clear benchmark triggering the kernel panic. Looping
> >> Stress_ng's sem test looks, however, like it's worth trying.
> >>
> >> I hope this helps. Please let me know if there's anything I can help
> >> with to pinpoint the problem.
> >>
> >> Thanks
> >> Jirka
> >>
> >>
> >> On Wed, Jun 18, 2025 at 7:19 AM Abhigyan ghosh
> >> <zscript.team.zs@...il.com> wrote:
> >>>
> >>> Hi Jirka,
> >>>
> >>> Thanks for the detailed report.
> >>>
> >>> I'm curious about the specific setup in which this panic was triggered. Could you share more about the exact configuration or parameters you used for running `stress-ng` or Linpack? For instance:
> >>>
> >>> - How many threads/cores were used?
> >>> - Was it running inside a VM, container, or bare-metal?
> >>> - Was this under any thermal throttling or power-saving mode?
> >>>
> >>> I'd like to try reproducing it locally to study the failure further.
> >>>
> >>> Best regards,
> >>> Abhigyan Ghosh
> >>>
> >>> On 18 June 2025 1:35:30 am IST, Jirka Hladky <jhladky@...hat.com> wrote:
> >>>> Hi all,
> >>>>
> >>>> I’ve encountered a reproducible kernel panic on 6.16-rc1 and 6.16-rc2
> >>>> involving a NULL pointer dereference in `__migrate_swap_task()` during
> >>>> CPU migration. This occurred on various AMD and Intel systems while
> >>>> running a CPU-intensive workload (Linpack, Stress_ng - it's not
> >>>> specific to a benchmark).
> >>>>
> >>>> Full trace below:
> >>>> ---
> >>>> BUG: kernel NULL pointer dereference, address: 00000000000004c8
> >>>> #PF: supervisor read access in kernel mode
> >>>> #PF: error_code(0x0000) - not-present page
> >>>> PGD 4078b99067 P4D 4078b99067 PUD 0
> >>>> Oops: Oops: 0000 [#1] SMP NOPTI
> >>>> CPU: 74 UID: 0 PID: 466 Comm: migration/74 Kdump: loaded Not tainted
> >>>> 6.16.0-0.rc2.24.eln149.x86_64 #1 PREEMPT(lazy)
> >>>> Hardware name: GIGABYTE R182-Z91-00/MZ92-FS0-00, BIOS M07 09/03/2021
> >>>> Stopper: multi_cpu_stop+0x0/0x130 <- migrate_swap+0xa7/0x120
> >>>> RIP: 0010:__migrate_swap_task+0x2f/0x170
> >>>> Code: 41 55 4c 63 ee 41 54 55 53 48 89 fb 48 83 87 a0 04 00 00 01 65
> >>>> 48 ff 05 e7 14 dd 02 48 8b af 50 0a 00 00 66 90 e8 61 93 07 00 <48> 8b
> >>>> bd c8 04 00 00 e8 85 11 35 00 48 85 c0 74 12 ba 01 00 00 00
> >>>> RSP: 0018:ffffce79cd90bdd0 EFLAGS: 00010002
> >>>> RAX: 0000000000000001 RBX: ffff8e9c7290d1c0 RCX: 0000000000000000
> >>>> RDX: ffff8e9c71e83680 RSI: 000000000000001b RDI: ffff8e9c7290d1c0
> >>>> RBP: 0000000000000000 R08: 00056e36392913e7 R09: 00000000002ab980
> >>>> R10: ffff8eac2fcb13c0 R11: ffff8e9c77997410 R12: ffff8e7c2fcf12c0
> >>>> R13: 000000000000001b R14: ffff8eac71eda944 R15: ffff8eac71eda944
> >>>> FS:  0000000000000000(0000) GS:ffff8eac9db4a000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 00000000000004c8 CR3: 0000003072388003 CR4: 0000000000f70ef0
> >>>> PKRU: 55555554
> >>>> Call Trace:
> >>>> <TASK>
> >>>> migrate_swap_stop+0xe8/0x190
> >>>> multi_cpu_stop+0xf3/0x130
> >>>> ? __pfx_multi_cpu_stop+0x10/0x10
> >>>> cpu_stopper_thread+0x97/0x140
> >>>> ? __pfx_smpboot_thread_fn+0x10/0x10
> >>>> smpboot_thread_fn+0xf3/0x220
> >>>> kthread+0xfc/0x240
> >>>> ? __pfx_kthread+0x10/0x10
> >>>> ? __pfx_kthread+0x10/0x10
> >>>> ret_from_fork+0xf0/0x110
> >>>> ? __pfx_kthread+0x10/0x10
> >>>> ret_from_fork_asm+0x1a/0x30
> >>>> </TASK>
> >>>> ---
> >>>>
> >>>> **Kernel Version:**
> >>>> 6.16.0-0.rc2.24.eln149.x86_64 (Fedora rawhide)
> >>>> https://koji.fedoraproject.org/koji/buildinfo?buildID=2732950
> >>>>
> >>>> **Reproducibility:**
> >>>> Happened multiple times during routine CPU-intensive operations. It
> >>>> happens with various benchmarks (Stress_ng, Linpack) after several
> >>>> hours of performance testing. `migration/*` kernel threads hit a NULL
> >>>> dereference in `__migrate_swap_task`.
> >>>>
> >>>> **System Info:**
> >>>> - Platform: GIGABYTE R182-Z91-00 (dual socket EPYC)
> >>>> - BIOS: M07 09/03/2021
> >>>> - Config: Based on Fedora’s debug kernel (`PREEMPT(lazy)`)
> >>>>
> >>>> **Crash Cause (tentative):**
> >>>> NULL dereference at offset `0x4c8` from a task struct pointer in
> >>>> `__migrate_swap_task`. Possibly an uninitialized or freed
> >>>> `task_struct` field.
> >>>>
> >>>> Please let me know if you’d like me to test a patch or if you need
> >>>> more details.
> >>>>
> >>>> Thanks,
> >>>> Jirka
> >>>>
> >>>>
> >>>
> >>> aghosh
> >>>
> >>
> >>
> >> --
> >> -Jirka
> >
> >
> >
>


-- 
-Jirka