[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131113102153.5f10e6b5@gandalf.local.home>
Date: Wed, 13 Nov 2013 10:21:53 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Matthew Whitehead <tedheadster@...il.com>, john.stultz@...aro.org,
linux-kernel@...r.kernel.org, mwhitehe@...hat.com
Subject: Re: nohz problem with idle time on old hardware
On Wed, 13 Nov 2013 15:02:14 +0100 (CET)
Thomas Gleixner <tglx@...utronix.de> wrote:
> On Wed, 13 Nov 2013, Matthew Whitehead wrote:
>
> > I was testing the 3.12 kernel on some _old_ hardware and I uncovered a bug.
> > It arises when nohz=on and goes away with nohz=off. On a crusty dual Pentium-1
> > system that is completely idle, the sar utility reports 0% idle time on cpu0
> > and 100% idle on cpu1. Cpu0 _should_ also be reporting 100% idle, but instead
> > it reports around 75% system time and 25% user time.
> >
> > The problem was diagnosed by Steve Rostedt with help from John
> > Stultz. The old system declares the dual TSCs unstable, and backs
> > down to a timesource of refined-jiffies. Apparently refined-jiffies
> > and jiffies are not a usable timesourcefor nohz, but we don't check
> > for that case because most modern systems have several reliable
> > hardware timesources.
>
> Wrong.
>
> > John suggested that we turn off nohz unless a usable hardware timesource is
> > present.
>
> nohz already depends on two things:
>
> 1) A reliable clocksource which is valid for highres/nohz
>
> 2) A per cpu clockevent device which supports one shot mode.
>
> and those are evaluated at runtime before we switch into NOHZ mode.
>
> And neither jiffies nor refined-jiffies qualify as valid clocksource.
>
> So there is something else wrong.
>
Perhaps the system isn't actually going into real nohz mode, but the
nohz code is screwing up the accounting.
Here's the facts:
1) We start out using the tsc time source
2) Later in boot, the tsc is declared unstable and we switch to
refined jiffies.
3) CPU0's time accounting is messed up as it shows no idle time
4) If we switch back to tsc, idle time appears
5) If we boot with nohz=off, idle time appears
Thus, it definitely has to do with accounting when we have nohz enabled
and refined-jiffies as our timesource. Why else will disabling nohz fix
the issue?
Here's a trace when things were bad:
trace-cmd record -p function -l '*nohz*' -l account_process_tick -e sched_switch
rcu_sche-9 0d... 6858.618033: sched_switch: rcu_sched:9 [120] S ==> swapper/0:0 [120]
<idle>-0 0.... 6858.618082: function: tick_nohz_idle_enter <-- cpu_startup_entry
<idle>-0 0d... 6858.618101: function: __tick_nohz_idle_enter <-- tick_nohz_idle_enter
<idle>-0 0d.s. 6858.621499: function: tick_nohz_stop_idle <-- tick_check_idle
<idle>-0 0d.h. 6858.621550: function: account_process_tick <-- update_process_times
<idle>-0 0d... 6858.621769: function: tick_nohz_irq_exit <-- irq_exit
<idle>-0 0d... 6858.621787: function: __tick_nohz_idle_enter <-- irq_exit
<idle>-0 0d.s. 6858.625500: function: tick_nohz_stop_idle <-- tick_check_idle
<idle>-0 0d.h. 6858.625574: function: account_process_tick <-- update_process_times
<idle>-0 0d... 6858.625818: function: tick_nohz_irq_exit <-- irq_exit
<idle>-0 0d... 6858.625847: function: __tick_nohz_idle_enter <-- irq_exit
<idle>-0 0d.s. 6858.629295: function: tick_nohz_stop_idle <-- tick_check_idle
<idle>-0 0d... 6858.629351: function: tick_nohz_irq_exit <-- irq_exit
<idle>-0 0d... 6858.629373: function: __tick_nohz_idle_enter <-- irq_exit
<idle>-0 0d.s. 6858.629503: function: tick_nohz_stop_idle <-- tick_check_idle
<idle>-0 0d.h. 6858.629569: function: account_process_tick <-- update_process_times
<idle>-0 0d... 6858.629851: function: tick_nohz_irq_exit <-- irq_exit
<idle>-0 0d... 6858.629881: function: __tick_nohz_idle_enter <-- irq_exit
<idle>-0 0d.s. 6858.630412: function: tick_nohz_stop_idle <-- tick_check_idle
<idle>-0 0.N.. 6858.630550: function: tick_nohz_idle_exit <-- cpu_startup_entry
<idle>-0 0d... 6858.630605: sched_switch: swapper/0:0 [120] R ==> rcu_sched:9 [120]
I'm not saying that we are actually getting into nohz, but something
with the nohz code is messing with cpu accounting.
I haven't added any trace_printks(), I'm only debugging with what comes
default with the kernel trace facility.
You can get the dmesg here: http://rostedt.homelinux.com/private/dmesg-jiffies-nohz
And I've attached the config.
-- Steve
Download attachment "config-3.12.0-rc6+" of type "application/octet-stream" (134840 bytes)
Powered by blists - more mailing lists