linux-kernel - Re: [BUG v6.16-rc3] kernel oops in __migrate_swap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20250626125715.GF1613200@noisy.programming.kicks-ass.net>
Date: Thu, 26 Jun 2025 14:57:15 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Dave Chinner <david@...morbit.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [BUG v6.16-rc3] kernel oops in __migrate_swap_task

On Wed, Jun 25, 2025 at 12:31:02PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> I had this happen once randomly on 6.16-rc2 but was messed up in
> amongst a heap of failures, but I've now got a clean, single failure
> on 6.16-rc3 that looks like this:
> 
> [11001.388660] BUG: kernel NULL pointer dereference, address: 00000000000004c8
> [11001.392374] #PF: supervisor read access in kernel mode
> [11001.394574] #PF: error_code(0x0000) - not-present page
> [11001.396687] PGD 0 P4D 0
> [11001.397821] Oops: Oops: 0000 [#1] SMP NOPTI
> [11001.399507] CPU: 10 UID: 0 PID: 66 Comm: migration/10 Not tainted 6.16.0-rc3-dgc+ #342 PREEMPT(full)
> [11001.403327] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [11001.407288] Stopper: multi_cpu_stop+0x0/0x120 <- migrate_swap+0x80/0x110
> [11001.410066] RIP: 0010:__migrate_swap_task+0x31/0x1a0
> [11001.412132] Code: 89 e5 41 57 41 56 53 48 89 fb 48 ff 87 60 03 00 00 41 89 f7 65 48 ff 05 7d 13 22 04 4c 8b b7 10 09 00 00 66 90 e8 ff db 05 00 0
> [11001.419845] RSP: 0018:ffffc90006677d90 EFLAGS: 00010002
> [11001.422015] RAX: ffff88810231d100 RBX: ffff888843982880 RCX: 000000000000392e
> [11001.425316] RDX: 0000000075dcabe7 RSI: 0000000000000020 RDI: ffff888843982880
> [11001.428559] RBP: ffffc90006677da8 R08: 0000000000000001 R09: 0000000000000090
> [11001.431695] R10: 0000000000000080 R11: 00000000000000d0 R12: ffff88881fcab440
> [11001.434781] R13: ffff88981fa2b440 R14: 0000000000000000 R15: 0000000000000020
> [11001.437783] FS:  0000000000000000(0000) GS:ffff88889a6f1000(0000) knlGS:0000000000000000
> [11001.441208] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [11001.443645] CR2: 00000000000004c8 CR3: 000000013c31b000 CR4: 0000000000350ef0
> [11001.446645] Call Trace:
> [11001.447770]  <TASK>
> [11001.448618]  migrate_swap_stop+0x16a/0x1d0

> I don't know exactly what triggers it, except to say I'm running the
> parallel variant of fstests with 64 concurrent tests being run.
> These tests are in parallel with operations like random CPU hotplug,

The stopper thread and hotplug involved make me think it *might* be
related to this:

  https://lkml.kernel.org/r/20250626125323.GG1613376@noisy.programming.kicks-ass.net

Which isn't a new problem, but seems to have popped up recently.

> memory migration, cache dropping, etc whilst there may be thousands
> of processes executing filesystem stress across more than a hundred
> mounted filesystems on loop devices.
> 
> This is not reproducable on 6.15.0, so it is likely a regression
> introduced in the 6.16 merge window....

Let me go look through the commits to see if anything stands out..

Thanks!