linux-kernel - Re: [GIT PULL] Scheduler updates for v6.17

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whgqmXgL_toAQWF793WuYMCNsBhvTW8B0xAD360eXX8-A@mail.gmail.com>
Date: Wed, 30 Jul 2025 20:31:44 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Tejun Heo <tj@...nel.org>, 
	Valentin Schneider <vschneid@...hat.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>
Subject: Re: [GIT PULL] Scheduler updates for v6.17

On Sun, 27 Jul 2025 at 23:48, Ingo Molnar <mingo@...nel.org> wrote:
>
> PSI:
>
>  - Improve scalability by optimizing psi_group_change() cpu_clock() usage
>    (Peter Zijlstra)

I suspect this is buggy.

Maybe this is coincidence, but that sounds very unlikely:

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:3:7996]
  CPU#0 Utilization every 4s during lockup:
          #1: 100% system,          0% softirq,          0% hardirq,
       0% idle
          #2: 100% system,          1% softirq,          1% hardirq,
       0% idle
          #3: 100% system,          0% softirq,          0% hardirq,
       0% idle
          #4: 101% system,          0% softirq,          0% hardirq,
       0% idle
          #5: 100% system,          0% softirq,          0% hardirq,
       0% idle
  Modules linked in: uinput rfcomm nf_nat_tftp nf_conntrack_tftp
bridge stp llc ccm nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet [...]
  CPU: 0 UID: 0 PID: 7996 Comm: kworker/0:3 Not tainted
6.16.0-06574-gd9104cec3e8f #164 VOLUNTARY
  Hardware name: Dell Inc. XPS 13 9380/0KTW76, BIOS 1.26.0 09/11/2023
  Workqueue: events psi_avgs_work
  RIP: 0010:collect_percpu_times+0x2f6/0x320
  Code: c0 0f b6 c0 c1 e0 09 41 09 c5 e9 14 ff ff ff 49 8b 0f 48 89 4c
24 48 49 8b 4f 08 48 89 4c 24 50 e9 6e fe ff ff 4c 89 c0 f3 90 <4a> 8b
14 ed c0 3c 20 93
  RSP: 0018:ffffd4d3cc113d60 EFLAGS: 00000202
  RAX: ffffffff93b26880 RBX: fffff4d3bfba0ed4 RCX: 000000000000622d
  RDX: ffff8ced1e597880 RSI: fffffffc6684cefc RDI: 0000000000000000
  RBP: ffffd4d3cc113db8 R08: ffffffff93b26880 R09: 0000000000000000
  R10: 00001386e5a9adc7 R11: 000000000000eda9 R12: ffffd4d3cc113dd8
  R13: 0000000000000006 R14: 0000000000000006 R15: fffff4d3bfba0ec0
  FS:  0000000000000000(0000) GS:ffff8ced8a8f1000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000027f400c50010 CR3: 00000001b641e005 CR4: 00000000003726f0
  Call Trace:
   <TASK>
   psi_avgs_work+0x31/0xa0
   process_one_work+0x135/0x220
   worker_thread+0x2e7/0x420
   kthread+0xbd/0x1a0
   ret_from_fork+0x133/0x160
   ret_from_fork_asm+0x11/0x20
   </TASK>

and yeah, the laptop was dead at that point. Thankfully it had been
alive enough that the watchdog messages made it into the logs.

There were more than one of those reports (34 of them to be exact) but
they all look pretty much the same. RIP is always the same at that
collect_percpu_times+0x2f6/0x320, but that's just the instruction
after the 'pause' instruction that is from

   psi_read_begin ->
       return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));

which is from that __read_seqcount_begin() code that waits for the
writer to go away:

        while (unlikely((__seq = seqprop_sequence(s)) & 1))             \
                cpu_relax();                                            \

and clearly it never does.

Why? I have no idea. But hopefully this makes somebody go "D'oh!" and
send me a trivial fix.

Please?

           Linus