[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F940B2E6-2B76-4008-98B9-B29C27512A60@oracle.com>
Date: Wed, 12 Nov 2025 06:30:50 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
"Paul E. McKenney"
<paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet
<corbet@....net>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
K Prateek
Nayak <kprateek.nayak@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Arnd Bergmann
<arnd@...db.de>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [patch V3 00/12] rseq: Implement time slice extension mechanism
> On Nov 11, 2025, at 8:42 AM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> On 2025-11-06 12:28, Prakash Sangappa wrote:
>> [...]
>>> Hit this watchdog panic.
>>>
>>> Using following tree. Assume this Is the latest.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
>>>
>>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>> When this happened during the development of the "complex" mm_cid
>> scheme, this was typically caused by a stale "mm_cid" being kept around
>> by a task even though it was not actually scheduled, thus causing
>> over-reservation of concurrency IDs beyond the max_cids threshold. This
>> ends up looping in:
>> static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> {
>> unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm- >mm_cid.max_cids));
>> while (cid == MM_CID_UNSET) {
>> cpu_relax();
>> cid = __mm_get_cid(mm, num_possible_cpus());
>> }
>> return cid;
>> }
>> Based on the stacktrace you provided, it seems to happen within
>> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
>> initialization issue in fork, or an issue when cloning a new thread ?
>
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:
[..]
> I see two possible issues here:
>
> A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
> mode without setting the mc->transit flag.
>
> B) sched_mm_cid_fork calls mm_get_cpu() before invoking
> mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
> mm_cids and make them available for mm_get_cpu().
>
> Thoughts ?
The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
It occurs soon after system boot up. Does not reproduce on a 64cpu VM.
Managed to grep the ‘mksquashfs’ command that was executing, which triggers the panic.
#ps -ef |grep mksquash.
root 16614 10829 0 05:55 ? 00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
I added following printk’s to mm_get_cid()
static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+ int max_cids = READ_ONCE(mm->mm_cid.max_cids);
+ long *addr = mm_cidmask(mm);
+
+ if (cid == MM_CID_UNSET) {
+ printk(KERN_INFO "pid %d, exec %s, maxcids %d percpu %d pcputhr %d, users %d nrcpus_allwd %d\n",
+ mm->owner->pid, mm->owner->comm,
+ max_cids,
+ mm->mm_cid.percpu,
+ mm->mm_cid.pcpu_thrs,
+ mm->mm_cid.users,
+ mm->mm_cid.nr_cpus_allowed);
+ printk(KERN_INFO "cid bitmask %lx %lx %lx %lx %lx %lx\n",
+ addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
+ }
while (cid == MM_CID_UNSET) {
cpu_relax();
Got following trace(trimmed).
[ 65.139543] pid 16614, exec mksquashfs, maxcids 82 percpu 0 pcputhr 0, users 66 nrcpus_allwd 384
[ 65.139544] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f43455357 44455a494c414954
[ 65.139597] pid 16614, exec mksquashfs, maxcids 83 percpu 0 pcputhr 0, users 67 nrcpus_allwd 384
[ 65.139599] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f4345535f 44455a494c414954
..
[ 65.142665] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a5fffffffff
[ 65.142750] pid 16614, exec mksquashfs, maxcids 155 percpu 0 pcputhr 0, users 124 nrcpus_allwd 384
[ 65.142752] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a7fffffffff
..
[ 65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[ 65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
[ 65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
Followed by the panic.
[ 99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
..
99.979340] RIP: 0010:mm_get_cid+0xf5/0x150
[ 99.979346] Code: 4d 8b 44 24 18 48 c7 c7 e0 07 86 b6 49 8b 4c 24 10 49 8b 54 24 08 41 ff 74 24 28 49 8b 34 24 e8 c1 b7 04 00 48 83 c4 18 f3 90 <8b> 05 65 ae ec 01 8b 35 eb e0 68 01 83 c0 3f 48 89 f5 c1 e8 03 25
[ 99.979348] RSP: 0018:ff75650cf9717d20 EFLAGS: 00000046
[ 99.979349] RAX: 0000000000000180 RBX: ff424236e5d55c40 RCX: 0000000000000180
[ 99.979351] RDX: 0000000000000000 RSI: 0000000000000180 RDI: ff424236e5d55cd0
[ 99.979352] RBP: 0000000000000180 R08: 0000000000000180 R09: c0000000fffdffff
[ 99.979352] R10: 0000000000000001 R11: ff75650cf9717a80 R12: ff424236e5d55ca0
[ 99.979353] R13: ff424236e5d55668 R14: ffa7650cba2841c0 R15: ff42423881a5aa80
[ 99.979355] FS: 00007f469ed6b740(0000) GS:ff424351c24d6000(0000) knlGS:0000000000000000
[ 99.979356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 99.979357] CR2: 00007f443b7fdfb8 CR3: 0000012724555006 CR4: 0000000000771ef0
[ 99.979358] PKRU: 55555554
[ 99.979359] Call Trace:
[ 99.979361] <TASK>
[ 99.979364] sched_mm_cid_fork+0x3fb/0x590
[ 99.979369] copy_process+0xd1a/0x2130
[ 99.979375] kernel_clone+0x9d/0x3b0
[ 99.979379] __do_sys_clone+0x65/0x90
[ 99.979384] do_syscall_64+0x64/0x670
[ 99.979388] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 99.979391] RIP: 0033:0x7f469d77d8c5
As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
Hopeful you can make out something from the above trace.
Let me know if you want me to add more tracing.
-Prakash
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
Powered by blists - more mailing lists