linux-kernel - Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f27baf17-87e2-470e-8d09-ad435331543f@efficios.com>
Date: Wed, 12 Nov 2025 15:40:11 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Prakash Sangappa <prakash.sangappa@...cle.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, LKML
 <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>,
 "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
 Jonathan Corbet <corbet@....net>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 Arnd Bergmann <arnd@...db.de>,
 "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [patch V3 00/12] rseq: Implement time slice extension mechanism

On 2025-11-12 01:30, Prakash Sangappa wrote:
[...]
> 
> The problem reproduces on a 2 socket AMD(384 cpus) bare metal system.
> It occurs soon after system boot up.  Does not reproduce on a 64cpu VM.
> 
> Managed to grep the ‘mksquashfs’ command that was executing, which  triggers the panic.
> 
> #ps -ef |grep mksquash.
> root       16614   10829  0 05:55 ?        00:00:00 mksquashfs /dev/null /var/tmp/dracut.iLs0z0/.squash-test.img -no-progress -comp xz
> 
> 

[...]

> ..
> [   65.143712] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> [   65.143767] pid 16614, exec mksquashfs, maxcids 175 percpu 0 pcputhr 0, users 140 nrcpus_allwd 384
> [   65.143769] cid bitmask ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
> 

It's weird that the cid bitmask is all f values (all 1). Aren't those
zeroed on mm init ?

> Followed by the panic.
> [   99.979256] watchdog: CPU114: Watchdog detected hard LOCKUP on cpu 114
> ..
[...]
> 
> As you can see, at least when it cannot find available cid’s it is in per-task mm cid mode.
> Perhaps it is taking longer to drop used cid’s? I have not delved into the mm cid management.
> Hopeful you can make out something from the above trace.
> 
> Let me know if you want me to add more tracing.

How soon is that after boot up ?

I'm starting to wonder if the num_possible_cpus() value used in
mm_cid_size() and mm_init_cid used respectively for mm allocation
and initialization may be read before it is initialized by the boot up
sequence ?

That's far fetched, but it would be good if we can double-check that
those are never called before the last call to init_cpu_possible and
set_cpu_possible().

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com