linux-kernel - Re: rcu_sched stalls in idle task introduced in pre-4.8?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20160802204504.GL3482@linux.vnet.ibm.com>
Date:	Tue, 2 Aug 2016 13:45:04 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Rich Felker <dalias@...c.org>
Cc:	linux-kernel@...r.kernel.org, john.stultz@...aro.org,
	tglx@...utronix.de
Subject: Re: rcu_sched stalls in idle task introduced in pre-4.8?

On Tue, Aug 02, 2016 at 04:32:17PM -0400, Rich Felker wrote:
> On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote:
> > > > > The system I'm testing on is an SMP configuration of the J2 SoC on
> > > > > FPGA (see: j-core.org), and the SMP support and timer driver (most
> > > > > relevant, imo) are new, but never exhibited this problem on 4.6 or
> > > > > 4.7. A tree with all drivers merged is available in the development
> > > > > branch of git://git.libc.org/linux-sh; it's regularly rebased but
> > > > > there's a perma-tag devel-20160801 you can check out instead. Fetching
> > > > > into an existing clone of Linus's tree is recommended since the server
> > > > > has limited resources.
> > > > > 
> > > > > I doubt the code from my devel branch will be useful without hardware
> > > > > to test on, though. Hopefully the issue is reproducible on other
> > > > > systems. Very low cpu speed (50 MHz) could be a factor. The problem
> > > > > seems to go away with dyntick disabled.
> > > > 
> > > > Indeed, it is quite possible that RCU's self-tuning needs a bit of help
> > > > for CPUs below a few hundred MHz.  As might other pieces of the kernel.
> > > > Should not be a big deal (famous last words...).  However, given that
> > > > you are running HZ=100, it should be reasonably close.  So first, let's
> > > > see what is going on here...
> > > > 
> > > > The following dmesg line usually indicates that RCU's kernel thread is
> > > > stuck for some reason.
> > > > 
> > > > [42039.188293] rcu_sched kthread starved for 2152 jiffies! g70249 c70248 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > 
> > > > The RCU_GP_WAIT_FQS indicates that it is waiting for a timer to expire.
> > > > The task is in TASK_INTERRUPTIBLE state, but it should have been awakened
> > > > after only a few jiffies sleep time.  The stack trace looks like there
> > > > might be trouble canceling the timer.
> > > 
> > > Can you offer any advice on tracking down what it's waiting for, or
> > > what's going on with the timer? Is any arch- or clockevent-specific
> > > stuff that might be my fault involved here?
> > 
> > It is in an swait_event_interruptible_timeout(), most likely waiting on
> > a one-jiffy timer, as in __mod_timer().  If I understood your earlier
> > email correctly, setting CONFIG_HZ_PERIODIC avoided this problem.
> > If my understanding is correct, it might be worth reviewing any timer
> > code that might be pushed along by the once-per-jiffy scheduler tick.
> 
> Is it possible that something wacky is going on with broadcast timers?
> A dummy broadcast timer device gets initialized momentatily before the
> percpu timer for the secondary cpu is initialized, and in my
> experience these broadcast timers don't work when the real timer
> device has a cpumask limiting it to a specific cpu (which seems like a
> bug, but one I haven't had to care about since I'm not using it).

I need to defer to the timer guys on this one, adding them on CC.

> > Does the problem reproduces easily?
> 
> Yes, it happens right after boot and repeats every 30-90 seconds or
> so.

Well, that at least makes it easier to test any patches!

> > A bisection might be very helpful.
> 
> Bisection would require some manual work to setup because the whole
> reason I was rebasing on Linus's tree was to adapt the drivers to
> upstream infrastructure changes (the new cpuhp stuff replacing
> notifier for cpu starting). The unfortunate way it was done, each
> driver adds an enum to linux/cpuhotplug.h so all the patches have
> gratuitous conflicts. In addition, for older revisions in Linus's
> tree, there's at least one show-stopping (hang during boot) bug that
> needs a cherry-pick to fix. There may be other small issues too. I
> don't think they're at all insurmountible but it requires an annoying
> amount of scripting.

I had to ask!  Might eventually be necessary, but let's see what we
can learn from what you currently have.

> > > > Is it possible that the timer
> > > > handler is being prevented from making progress due to heavy interrupt
> > > > load or some such?
> > > 
> > > Actually the stall messages are only appearing when the system is idle
> > > and interrupts are infrequent. That's what leads me to believe it
> > > might be a situation where, on every (infrequent) wake, both the rcu
> > > task and something blocking its progress happen at roughly the same
> > > time, and the slowness of the cpu leads the rcu task to decide to wait
> > > for the next timer/schedule event and try again -- over and over. Of
> > > course this hypothesis is just based on a very high-level
> > > understanding of what's going on; I'm not sufficiently familiar wilth
> > > the implementation specifics to offer more.
> > > 
> > > One observation I just made versus old kernels where this was not
> > > happening: I've got a lot more timer interrupts happening than before.
> > > In 6 minutes of uptime:
> > > 
> > >            CPU0       CPU1
> > >  17:         38          0       AIC  17  abce0000.ethernet
> > >  18:        811          0       AIC  18  uartlite
> > >  20:        188        121       AIC  20  ipi
> > >  72:      11295       6499       AIC  72  jcore_pit
> > > 
> > > That's 30+ timer interrupts per sec on cpu0. Previously it was more
> > > like 1-5 per second when idle (with dyntick).
> > 
> > That is interesting, but I don't know what to make of it.  RCU does depend
> > heavily on timed sleeps to make forward progress, so if timer interrupts
> > were lost, that would be a smoking gun.  But you actually appear to have
> > -more- timer interrupts in the failure case rather than fewer!
> 
> Could the extra interrupts be a symptom of the rcu task doing a
> very-short sleep to retry over and over?

Possibly.  One way to check for this is to set the kernel boot variables
rcutree.jiffies_till_sched_qs=10 and rcutree.jiffies_till_first_fqs=10.

> > > > > My .config and message log are below. Let me know if there are any
> > > > > other details I can provide that would be helpful.
> > > > 
> > > > In case the timer issue I am suspecting is a red herring...
> > > > 
> > > > RCU does tune itself for different values of HZ, but it might need more
> > > > help for 50MHz CPUs.  (Although if this were the problem, I would expect
> > > > you to be seeing excessive CPU consumption by the rcu_sched kthread.)
> > > > But it might be worthwhile booting with rcutree.jiffies_till_sched_qs=10
> > > > and rcutree.jiffies_till_first_fqs=10 (or thereabouts).  It might also
> > > > be necessary to use a longer RCU CPU stall warning timeout (assuming
> > > > you don't just suppress it as above), for example, by booting with (say)
> > > > rcupdate.rcu_cpu_stall_timeout=60.
> > > 
> > > With that in mind, it seems plausible that the problem existed before
> > > and just went over the threshold for reporting the stall with a
> > > non-bug change going into 4.8. Would building an older known-good
> > > kernel with TREE_RCU_TRACE enabled and checking debugfs
> > > rcu/rcu_sched/rcugp for max=??? be informative?
> > 
> > It might be.  However, if ftrace is enabled on your platform, getting
> > a trace of the RCU-related events would be even more likely to be
> > helpful.  And for that matter, the timer-related events.
> 
> It looks like ftrace is supported on sh. Looks like I have some
> reading to do -- this could be very useful.

Sounds good!

> > One way to determine whether we were just on the edge on
> > the older systems would be to boot them with smaller values of
> > rcupdate.rcu_cpu_stall_timeout=60.  The default values of the other two
> > boot parameters looks to already be at the minimum usefule value, though.
> 
> CONFIG_RCU_CPU_STALL_TIMEOUT=21 seems to be what it's hitting.

Confused.  If you set CONFIG_RCU_CPU_STALL_TIMEOUT to smaller values on
older kernels you see the RCU CPU stall warnings?

							Thanx, Paul