linux-kernel - Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F940B2E6-2B76-4008-98B9-B29C27512A60@oracle.com>
Date: Wed, 12 Nov 2025 06:30:50 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Paul E. McKenney"
	<paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet
	<corbet@....net>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
        K Prateek
 Nayak <kprateek.nayak@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Arnd Bergmann
	<arnd@...db.de>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [patch V3 00/12] rseq: Implement time slice extension mechanism



> On Nov 11, 2025, at 8:42 AM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> 
> On 2025-11-10 09:23, Mathieu Desnoyers wrote:
>> On 2025-11-06 12:28, Prakash Sangappa wrote:
>> [...]
>>> Hit this watchdog panic.
>>> 
>>> Using following tree. Assume this Is the latest.
>>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/ rseq/ slice
>>> 
>>> Appears to be spinning in mm_get_cid(). Must be the mm cid changes.
>>> https://lore.kernel.org/all/20251029123717.886619142@linutronix.de/
>> When this happened during the development of the "complex" mm_cid
>> scheme, this was typically caused by a stale "mm_cid" being kept around
>> by a task even though it was not actually scheduled, thus causing
>> over-reservation of concurrency IDs beyond the max_cids threshold. This
>> ends up looping in:
>> static inline unsigned int mm_get_cid(struct mm_struct *mm)
>> {
>>         unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm-  >mm_cid.max_cids));
>>         while (cid == MM_CID_UNSET) {
>>                 cpu_relax();
>>                 cid = __mm_get_cid(mm, num_possible_cpus());
>>         }
>>         return cid;
>> }
>> Based on the stacktrace you provided, it seems to happen within
>> sched_mm_cid_fork() within copy_process, so perhaps it's simply an
>> initialization issue in fork, or an issue when cloning a new thread ?
> 
> I've spent some time digging through Thomas' implementation of
> mm_cid management. I've spotted something which may explain
> the watchdog panic. Here is the scenario:

[..]
> I see two possible issues here:
> 
> A) mm_update_cpus_allowed can transition from per-cpu to per-task mm_cid
>   mode without setting the mc->transit flag.
> 
> B) sched_mm_cid_fork calls mm_get_cpu() before invoking
>   mm_cid_fixup_cpus_to_tasks() which would reclaim stale per-cpu
>   mm_cids and make them available for mm_get_cpu().
> 
> Thoughts ?

The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.

Managed to grep the ‘mksquashfs’ command that was executing, which  triggers the panic. 

#ps -ef |grep mksquash.
root       16614   10829  0 05:55 ?        00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz


I added following printk’s to mm_get_cid()

static inline unsigned int mm_get_cid(struct mm_struct *mm)
 {
        unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+       int max_cids = READ_ONCE(mm->mm_cid.max_cids);
+       long *addr = mm_cidmask(mm);
+
+       if (cid == MM_CID_UNSET) {
+               printk(KERN_INFO "pid %d, exec %s, maxcids %d percpu %d pcputhr %d, users %d nrcpus_allwd %d\n",
+                               mm->owner->pid, mm->owner->comm,
+                               max_cids,
+                               mm->mm_cid.percpu,
+                               mm->mm_cid.pcpu_thrs,
+                               mm->mm_cid.users,
+                               mm->mm_cid.nr_cpus_allowed);
+               printk(KERN_INFO "cid bitmask %lx %lx %lx %lx %lx %lx\n",
+                       addr[0], addr[1], addr[2], addr[3], addr[4], addr[5]);
 
+       }
        while (cid == MM_CID_UNSET) {
                cpu_relax();

Got following trace(trimmed). 

[   65.139543] pid 16614, exec mksquashfs, maxcids 82 percpu 0 pcputhr 0, users 66 nrcpus_allwd 384
[   65.139544] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f43455357 44455a494c414954
[   65.139597] pid 16614, exec mksquashfs, maxcids 83 percpu 0 pcputhr 0, users 67 nrcpus_allwd 384
[   65.139599] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 494e495f4345535f 44455a494c414954
..
[   65.142665] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a5fffffffff
[   65.142750] pid 16614, exec mksquashfs, maxcids 155 percpu 0 pcputhr 0, users 124 nrcpus_allwd 384
[   65.142752] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 44455a7fffffffff
..
[   65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
[   65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
[   65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff

Followed by the panic.
[   99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
.. 
 99.979340] RIP: 0010:mm_get_cid+0xf5/0x150
[   99.979346] Code: 4d 8b 44 24 18 48 c7 c7 e0 07 86 b6 49 8b 4c 24 10 49 8b 54 24 08 41 ff 74 24 28 49 8b 34 24 e8 c1 b7 04 00 48 83 c4 18 f3 90 <8b> 05 65 ae ec 01 8b 35 eb e0 68 01 83 c0 3f 48 89 f5 c1 e8 03 25
[   99.979348] RSP: 0018:ff75650cf9717d20 EFLAGS: 00000046
[   99.979349] RAX: 0000000000000180 RBX: ff424236e5d55c40 RCX: 0000000000000180
[   99.979351] RDX: 0000000000000000 RSI: 0000000000000180 RDI: ff424236e5d55cd0
[   99.979352] RBP: 0000000000000180 R08: 0000000000000180 R09: c0000000fffdffff
[   99.979352] R10: 0000000000000001 R11: ff75650cf9717a80 R12: ff424236e5d55ca0
[   99.979353] R13: ff424236e5d55668 R14: ffa7650cba2841c0 R15: ff42423881a5aa80
[   99.979355] FS:  00007f469ed6b740(0000) GS:ff424351c24d6000(0000) knlGS:0000000000000000
[   99.979356] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.979357] CR2: 00007f443b7fdfb8 CR3: 0000012724555006 CR4: 0000000000771ef0
[   99.979358] PKRU: 55555554
[   99.979359] Call Trace:
[   99.979361]  <TASK>
[   99.979364]  sched_mm_cid_fork+0x3fb/0x590
[   99.979369]  copy_process+0xd1a/0x2130
[   99.979375]  kernel_clone+0x9d/0x3b0
[   99.979379]  __do_sys_clone+0x65/0x90
[   99.979384]  do_syscall_64+0x64/0x670
[   99.979388]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   99.979391] RIP: 0033:0x7f469d77d8c5


As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management. 
Hopeful you can make out something from the above trace.

Let me know if you want me to add more tracing. 

-Prakash


> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com