[<prev] [next>] [day] [month] [year] [list]
Message-ID: <aFtfZgcL66nq6TcY@dread.disaster.area>
Date: Wed, 25 Jun 2025 12:31:02 +1000
From: Dave Chinner <david@...morbit.com>
To: linux-kernel@...r.kernel.org
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: [BUG v6.16-rc3] kernel oops in __migrate_swap_task
Hi folks,
I had this happen once randomly on 6.16-rc2 but was messed up in
amongst a heap of failures, but I've now got a clean, single failure
on 6.16-rc3 that looks like this:
[11001.388660] BUG: kernel NULL pointer dereference, address: 00000000000004c8
[11001.392374] #PF: supervisor read access in kernel mode
[11001.394574] #PF: error_code(0x0000) - not-present page
[11001.396687] PGD 0 P4D 0
[11001.397821] Oops: Oops: 0000 [#1] SMP NOPTI
[11001.399507] CPU: 10 UID: 0 PID: 66 Comm: migration/10 Not tainted 6.16.0-rc3-dgc+ #342 PREEMPT(full)
[11001.403327] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[11001.407288] Stopper: multi_cpu_stop+0x0/0x120 <- migrate_swap+0x80/0x110
[11001.410066] RIP: 0010:__migrate_swap_task+0x31/0x1a0
[11001.412132] Code: 89 e5 41 57 41 56 53 48 89 fb 48 ff 87 60 03 00 00 41 89 f7 65 48 ff 05 7d 13 22 04 4c 8b b7 10 09 00 00 66 90 e8 ff db 05 00 0
[11001.419845] RSP: 0018:ffffc90006677d90 EFLAGS: 00010002
[11001.422015] RAX: ffff88810231d100 RBX: ffff888843982880 RCX: 000000000000392e
[11001.425316] RDX: 0000000075dcabe7 RSI: 0000000000000020 RDI: ffff888843982880
[11001.428559] RBP: ffffc90006677da8 R08: 0000000000000001 R09: 0000000000000090
[11001.431695] R10: 0000000000000080 R11: 00000000000000d0 R12: ffff88881fcab440
[11001.434781] R13: ffff88981fa2b440 R14: 0000000000000000 R15: 0000000000000020
[11001.437783] FS: 0000000000000000(0000) GS:ffff88889a6f1000(0000) knlGS:0000000000000000
[11001.441208] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11001.443645] CR2: 00000000000004c8 CR3: 000000013c31b000 CR4: 0000000000350ef0
[11001.446645] Call Trace:
[11001.447770] <TASK>
[11001.448618] migrate_swap_stop+0x16a/0x1d0
[11001.450458] multi_cpu_stop+0xcd/0x120
[11001.452034] ? __pfx_multi_cpu_stop+0x10/0x10
[11001.453895] cpu_stopper_thread+0xdc/0x190
[11001.455631] smpboot_thread_fn+0x150/0x230
[11001.457447] kthread+0x20c/0x240
[11001.458824] ? __pfx_smpboot_thread_fn+0x10/0x10
[11001.460760] ? __pfx_kthread+0x10/0x10
[11001.462312] ret_from_fork+0x77/0x140
[11001.463938] ? __pfx_kthread+0x10/0x10
[11001.465570] ret_from_fork_asm+0x1a/0x30
[11001.467272] </TASK>
[11001.468233] Modules linked in:
[11001.469586] CR2: 00000000000004c8
[11001.471094] ---[ end trace 0000000000000000 ]---
[11001.473044] RIP: 0010:__migrate_swap_task+0x31/0x1a0
[11001.475112] Code: 89 e5 41 57 41 56 53 48 89 fb 48 ff 87 60 03 00 00 41 89 f7 65 48 ff 05 7d 13 22 04 4c 8b b7 10 09 00 00 66 90 e8 ff db 05 00 0
[11001.482744] RSP: 0018:ffffc90006677d90 EFLAGS: 00010002
[11001.484925] RAX: ffff88810231d100 RBX: ffff888843982880 RCX: 000000000000392e
[11001.487844] RDX: 0000000075dcabe7 RSI: 0000000000000020 RDI: ffff888843982880
[11001.490797] RBP: ffffc90006677da8 R08: 0000000000000001 R09: 0000000000000090
[11001.493740] R10: 0000000000000080 R11: 00000000000000d0 R12: ffff88881fcab440
[11001.496642] R13: ffff88981fa2b440 R14: 0000000000000000 R15: 0000000000000020
[11001.499579] FS: 0000000000000000(0000) GS:ffff88889a6f1000(0000) knlGS:0000000000000000
[11001.502873] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11001.505253] CR2: 00000000000004c8 CR3: 000000013c31b000 CR4: 0000000000350ef0
[11001.508155] note: migration/10[66] exited with irqs disabled
At this point, the system effectively hangs, and the only thing that
fires from here on is soft lockup warnings like so:
[11024.562203] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [xfs_io:3957707]
[11024.562211] Modules linked in:
[11024.562217] CPU: 24 UID: 0 PID: 3957707 Comm: xfs_io Tainted: G D 6.16.0-rc3-dgc+ #342 PREEMPT(full)
[11024.562224] Tainted: [D]=DIE
[11024.562225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[11024.562228] RIP: 0010:pv_native_safe_halt+0xf/0x20
[11024.562238] Code: e3 8c 00 5d c3 cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d 41 eb 15 00 55 48 89 e5 fb f4 0
[11024.562241] RSP: 0018:ffffc90021ba7d60 EFLAGS: 00000246
[11024.562244] RAX: 0000000000000001 RBX: ffffffff83208044 RCX: 0000000000000000
[11024.562247] RDX: fffffffffffffff8 RSI: 0000000000000001 RDI: ffff88901fc2c214
[11024.562248] RBP: ffffc90021ba7d60 R08: ffff88901fc2c214 R09: 00000000412a7da6
[11024.562250] R10: 00007fe40e7a0000 R11: ffff8881603c0800 R12: 0000000000640000
[11024.562252] R13: ffff88889a631000 R14: ffff88901fc2c214 R15: ffff88901fc2c200
[11024.562259] FS: 0000000000000000(0000) GS:ffff88909a671000(0000) knlGS:0000000000000000
[11024.562261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11024.562263] CR2: 00007fe40e82d3dc CR3: 00000001603c0000 CR4: 0000000000350ef0
[11024.562269] Call Trace:
[11024.562272] <TASK>
[11024.562275] kvm_wait+0x6b/0x80
[11024.562282] __pv_queued_spin_lock_slowpath+0x173/0x430
[11024.562287] queued_read_lock_slowpath+0x6f/0x120
[11024.562291] _raw_read_lock+0x2b/0x40
[11024.562294] mm_update_next_owner+0x53/0x270
[11024.562300] exit_mm+0xa9/0x100
[11024.562303] do_exit+0x1b9/0x980
[11024.562307] do_group_exit+0x8f/0x90
[11024.562311] __x64_sys_exit_group+0x17/0x20
[11024.562315] x64_sys_call+0x2f60/0x2f60
[11024.562321] do_syscall_64+0x6b/0x1f0
[11024.562325] ? exc_page_fault+0x70/0xa0
[11024.562328] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[11024.562331] RIP: 0033:0x7fe40eb04295
I don't know exactly what triggers it, except to say I'm running the
parallel variant of fstests with 64 concurrent tests being run.
These tests are in parallel with operations like random CPU hotplug,
memory migration, cache dropping, etc whilst there may be thousands
of processes executing filesystem stress across more than a hundred
mounted filesystems on loop devices.
This is not reproducable on 6.15.0, so it is likely a regression
introduced in the 6.16 merge window....
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists