[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <512d22c0-4d4e-ddc5-a096-00abdac18d63@molgen.mpg.de>
Date:   Thu, 16 Jan 2020 14:33:34 +0100
From:   Paul Menzel <pmenzel@...gen.mpg.de>
To:     "Paul E. McKenney" <paulmck@...ux.ibm.com>
Cc:     Josh Triplett <josh@...htriplett.org>, x86@...nel.org,
        linux-kernel@...r.kernel.org, it+lkml@...gen.mpg.de
Subject: Re: `rcu: INFO: rcu_sched detected stalls on CPUs/tasks` on AMD EPYC
 server
Dear Paul,
On 2019-05-15 17:28, Paul E. McKenney wrote:
> On Wed, May 15, 2019 at 01:22:06PM +0200, Paul Menzel wrote:
>> Building a Linux kernel (like 5.1.2) on a 128 thread AMD EPYC server with 126, 127,
>> or 128 threads *sometimes* the server becomes unusable and logging in over network
>> is not possible anymore. Only logging in over tty1 works, and the server needs to
>> be rebooted.
>>
>> ```
>> [    0.000000] Linux version 4.19.19.mx64.244 (root@...internet.molgen.mpg.de) (gcc version 7.3.0 (GCC)) #1 SMP Tue Feb 5 13:01:13 CET 2019
>> […]
>> [2418051.367223] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> [2418051.367231] rcu: 	30-...0: (1 GPs behind) idle=4c2/1/0x4000000000000000 softirq=187416321/187416325 fqs=14323 
>> [2418051.367235] rcu: 	94-...0: (1 GPs behind) idle=bba/1/0x4000000000000000 softirq=187177539/187177544 fqs=14323 
>> [2418051.367236] rcu: 	(detected by 2, t=60002 jiffies, g=298982765, q=7633949)
>> [2418051.367254] Sending NMI from CPU 2 to CPUs 30:
>> [2418061.370201] Sending NMI from CPU 2 to CPUs 94:
> 
> Huh.  Neither CPU 30 nor CPU 94 responded to the NMI.  This usually means
> that either NMIs aren't working or that the target CPUs are so deeply
> in trouble that they cannot respond to NMIs.  One historic reason that
> the CPUs could be so deeply in trouble would be if the stack pointer
> started referencing unmapped memory, but I have no idea whether that
> applies to your particular CPUs.
> 
> For whatever it is worth, the most extreme case of a CPU being in trouble
> was once long ago when the CPU simply failstopped, so that it was no
> longer executing instructions at all.
> 
> On trick that might (or might not) get you more information is to force
> RCU to dump the stack remotely instead of sending NMIs.  Here is an
> (untested) patch that should do the trick:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index acd6ccf56faf..a4225c2f24f4 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1220,8 +1220,7 @@ static void rcu_dump_cpu_stacks(void)
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		for_each_leaf_node_possible_cpu(rnp, cpu)
>  			if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
> -				if (!trigger_single_cpu_backtrace(cpu))
> -					dump_cpu_task(cpu);
> +				dump_cpu_task(cpu);
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  	}
>  }
> 
> Please let me know how it goes!
[…]
Sorry for the late reply. I do not know how to reproduce the problem,
and haven’t applied your patch yet.
Does it just change some logging, so is it safe to use in production?
Stalls were seen again on an Supermicro 2023US-TR4 with two 64 thread
AMD EPYC 7601 with Linux 5.4.5. (I am pretty sure also on a Dell
PowerEdge R7425 with two 64 thread AMD EPYC 7601 with Linux 4.19.57.)
[333028.288747] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[333028.288755] rcu: 	30-...0: (0 ticks this GP) idle=316/1/0x4000000000000000 softirq=179289932/179289932 fqs=13755 
[333028.288772] 	(detected by 112, t=60002 jiffies, g=81535941, q=52030406)
[333028.288777] Sending NMI from CPU 112 to CPUs 30:
[333038.235272] rcu: rcu_sched kthread starved for 9945 jiffies! g81535941 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=56
[333038.235273] rcu: RCU grace-period kthread stack dump:
[333038.235276] rcu_sched       R  running task        0    11      2 0x80004000
[333038.235280] Call Trace:
[333038.235294]  ? reschedule_interrupt+0xa/0x20
[333038.235299]  ? rcu_cleanup_dead_rnp+0x70/0x70
[333038.235303]  ? _raw_spin_unlock_irqrestore+0xa/0x10
[333038.235304]  ? force_qs_rnp+0x11d/0x140
[333038.235306]  ? rcu_gp_kthread+0x61a/0xed0
[333038.235307]  ? __schedule+0x228/0x6c0
[333038.235312]  ? __wake_up_common+0x74/0x120
[333038.235314]  ? kfree_call_rcu+0x10/0x10
[333038.235317]  ? kthread+0x117/0x130
[333038.235319]  ? kthread_create_worker_on_cpu+0x70/0x70
[333038.235320]  ? ret_from_fork+0x22/0x40
[333208.292343] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[333208.292350] rcu: 	30-...0: (0 ticks this GP) idle=316/1/0x4000000000000000 softirq=179289932/179289932 fqs=53333 
[333208.292367] 	(detected by 25, t=240007 jiffies, g=81535941, q=216698517)
[333208.292370] Sending NMI from CPU 25 to CPUs 30:
[333218.297650] rcu: rcu_sched kthread starved for 10003 jiffies! g81535941 f0x2 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=45
[333218.297652] rcu: RCU grace-period kthread stack dump:
[333218.297653] rcu_sched       R  running task        0    11      2 0x80004000
[333218.297656] Call Trace:
[333218.297667]  ? reschedule_interrupt+0xa/0x20
[333218.297670]  ? rcu_cleanup_dead_rnp+0x70/0x70
[333218.297673]  ? _raw_spin_unlock_irqrestore+0xa/0x10
[333218.297675]  ? force_qs_rnp+0x11d/0x140
[333218.297676]  ? rcu_gp_kthread+0x61a/0xed0
[333218.297677]  ? __schedule+0x228/0x6c0
[333218.297680]  ? __wake_up_common+0x74/0x120
[333218.297682]  ? kfree_call_rcu+0x10/0x10
[333218.297685]  ? kthread+0x117/0x130
[333218.297686]  ? kthread_create_worker_on_cpu+0x70/0x70
[333218.297688]  ? ret_from_fork+0x22/0x40
If it’s safe for production, I’d apply your patch to our Linux builds
to get more information the next time this happens.
Kind regards,
Paul
View attachment "amd-epyc-7601-linux-5.4.5-messages-nmi-errors.txt" of type "text/plain" (181946 bytes)
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5174 bytes)
Powered by blists - more mailing lists
 
