lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20171014125116.GA8791@linux.vnet.ibm.com>
Date:   Sat, 14 Oct 2017 05:51:16 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Wang YanQing <udknight@...il.com>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Bug report for RCU stalled warning [3.10.69]

On Thu, Oct 12, 2017 at 01:38:24PM -0700, Paul E. McKenney wrote:
> [ Adding LKML on CC so that others can find this. ]
> 
> On Wed, Oct 11, 2017 at 12:21:39PM +0800, Wang YanQing wrote:
> > Hi, Paul McKenney.
> > 
> > I have received many machine-stopped-respone reports, after reboot and
> > inspect message, all of them show RCU stalled, but I can't figure out
> > how to fix it. I can't update the kernel, it is the painful point, so I
> > need to fix it in 3.10. I have attached four messages come from different
> > cpu and broads(so I guess it is a BUG instead of hardware fault), any
> > suggestion is welcome.
> 
> The first step is of course to report this to your distro, as they are
> the ones who do the care and feeding of such old kernels.  Please include
> the information below in that report, as it might help your distro find
> and fix the problem.
> 
> It looks like the stalled CPU is idle, and that the activity resulting
> from the stall-warning message gets things going again.  Callbacks are
> being processed, so no OOM.  But you are getting the splat every 60
> seconds.  The system has only two CPUs, and is x86.
> 
> If you cannot upgrade the kernel, my ability to help is limited.  And the
> diagnostics printed with the v3.10 CPU stall warnings are also quite
> limited.  However, there are some things you could try as workarounds:
> 
> 1.	Check to make sure that the rcu_sched kthread is getting
> 	the CPU time that it needs.  Preventing this kthread from
> 	running would create exactly this output, assuming that
> 	the stall warning got it going again temporarily.
> 
> 2.	It looks like the disturbance of the RCU CPU stall warning
> 	is getting things going again.  Try artificially providing
> 	this disturbance, for example, by running a usermode program
> 	or script that runs on each CPU in turn, then sleeps for
> 	(say) five seconds.
> 
> 3.	If you can reconfigure your kernel, try building with
> 	CONFIG_RCU_FAST_NO_HZ=n.

And if you can reconfigure kernel, in v3.10, building with
CONFIG_RCU_CPU_STALL_INFO and CONFIG_RCU_CPU_STALL_VERBOSE will provide
more information on the CPUs and tasks stalling the grace period.

							Thanx, Paul

> 4.	Was the system running reliably on some earlier version?
> 	If so, consider reverting back to that version, and include
> 	the version information in your report to your distro.  If
> 	your distro provides individual patches, you should consider
> 	bisecting so as to locate the offending patch.
> 
> Good luck with it!
> 
> 							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ