linux-kernel - `rcu: INFO: rcu_sched detected stalls on CPUs/tasks` on AMD EPYC server

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <8e3041f5-4e57-acd0-027b-f1ef59c37bb0@molgen.mpg.de>
Date:   Wed, 15 May 2019 13:22:06 +0200
From:   Paul Menzel <pmenzel@...gen.mpg.de>
To:     "Paul E. McKenney" <paulmck@...ux.ibm.com>,
        Josh Triplett <josh@...htriplett.org>
Cc:     "x86@...nel.org" <x86@...nel.org>, linux-kernel@...r.kernel.org
Subject: `rcu: INFO: rcu_sched detected stalls on CPUs/tasks` on AMD EPYC
 server

Dear Linux folks,


Building a Linux kernel (like 5.1.2) on a 128 thread AMD EPYC server with 126, 127,
or 128 threads *sometimes* the server becomes unusable and logging in over network
is not possible anymore. Only logging in over tty1 works, and the server needs to
be rebooted.

```
[    0.000000] Linux version 4.19.19.mx64.244 (root@...internet.molgen.mpg.de) (gcc version 7.3.0 (GCC)) #1 SMP Tue Feb 5 13:01:13 CET 2019
[…]
[2418051.367223] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[2418051.367231] rcu: 	30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=14323 
[2418051.367235] rcu: 	94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=14323 
[2418051.367236] rcu: 	(detected by 2, t=60002 jiffies, g=298982765, q=7633949)
[2418051.367254] Sending NMI from CPU 2 to CPUs 30:
[2418061.370201] Sending NMI from CPU 2 to CPUs 94:
[2418071.372935] rcu: rcu_sched kthread starved for 20004 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=106
[2418071.372936] rcu: RCU grace-period kthread stack dump:
[2418071.372938] rcu_sched       R  running task        0    11      2 0x80000000
[2418071.372940] Call Trace:
[2418071.372947]  ? _raw_spin_unlock_irqrestore+0xa/0x10
[2418071.372950]  ? force_qs_rnp+0x11e/0x140
[2418071.372952]  ? rcu_gp_kthread+0x62b/0xdf0
[2418071.372953]  ? __schedule+0x1f8/0x7b0
[2418071.372955]  ? rcu_gp_slow.isra.40.part.41+0x30/0x30
[2418071.372957]  ? kthread+0x113/0x130
[2418071.372958]  ? kthread_park+0x90/0x90
[2418071.372960]  ? ret_from_fork+0x22/0x40
[2418231.372935] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[2418231.372943] rcu: 	30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=52808 
[2418231.372946] rcu: 	94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=52808 
[2418231.372947] rcu: 	(detected by 5, t=240007 jiffies, g=298982765, q=8914782)
[2418231.372959] Sending NMI from CPU 5 to CPUs 30:
[2418241.375808] Sending NMI from CPU 5 to CPUs 94:
[2418251.378374] rcu: rcu_sched kthread starved for 20002 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=60
[2418251.378376] rcu: RCU grace-period kthread stack dump:
[2418251.378378] rcu_sched       R  running task        0    11      2 0x80000000
[2418251.378381] Call Trace:
[2418251.378388]  ? _raw_spin_unlock_irqrestore+0xa/0x10
[2418251.378392]  ? force_qs_rnp+0x11e/0x140
[2418251.378393]  ? rcu_gp_kthread+0x62b/0xdf0
[2418251.378395]  ? __schedule+0x1f8/0x7b0
[2418251.378397]  ? rcu_gp_slow.isra.40.part.41+0x30/0x30
[2418251.378399]  ? kthread+0x113/0x130
[2418251.378400]  ? kthread_park+0x90/0x90
[2418251.378402]  ? ret_from_fork+0x22/0x40
[2418411.378841] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[2418411.378849] rcu: 	30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=91376 
[2418411.378852] rcu: 	94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=91376 
[2418411.378853] rcu: 	(detected by 3, t=420012 jiffies, g=298982765, q=10176682)
[2418411.378866] Sending NMI from CPU 3 to CPUs 30:
[2418421.381889] Sending NMI from CPU 3 to CPUs 94:
[2418431.384518] rcu: rcu_sched kthread starved for 20004 jiffies! g298982765 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=107
[2418431.384520] rcu: RCU grace-period kthread stack dump:
[2418431.384521] rcu_sched       R  running task        0    11      2 0x80000000
[2418431.384523] Call Trace:
[2418431.384530]  ? _raw_spin_unlock_irqrestore+0xa/0x10
[2418431.384533]  ? force_qs_rnp+0x11e/0x140
[2418431.384535]  ? rcu_gp_kthread+0x62b/0xdf0
[2418431.384537]  ? __schedule+0x1f8/0x7b0
[2418431.384538]  ? rcu_gp_slow.isra.40.part.41+0x30/0x30
[2418431.384540]  ? kthread+0x113/0x130
[2418431.384541]  ? kthread_park+0x90/0x90
[2418431.384543]  ? ret_from_fork+0x22/0x40
[…]
```

Do you see anything in the attached logs, which could cause this?


Kind regards,

Paul

View attachment "dmesg.txt" of type "text/plain" (156803 bytes)

View attachment "ps-aux.txt" of type "text/plain" (94209 bytes)

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5174 bytes)