lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151130171918.GR26643@linux.vnet.ibm.com>
Date:	Mon, 30 Nov 2015 09:19:18 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	fupan li <lifupan@...il.com>
Cc:	linux-rt-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: RCU stall and the system boot hang

On Mon, Nov 30, 2015 at 02:54:13PM +0800, fupan li wrote:
> 2015-11-29 14:05 GMT+08:00 Paul E. McKenney <paulmck@...ux.vnet.ibm.com>:
> 
> > On Sun, Nov 29, 2015 at 12:46:10PM +0800, fupan li wrote:
> > > 2015-11-28 22:53 GMT+08:00 Paul E. McKenney <paulmck@...ux.vnet.ibm.com
> > >:
> > >
> > > > On Sat, Nov 28, 2015 at 01:05:52PM +0800, fupan li wrote:
> > > > > 2015-11-28 0:28 GMT+08:00 Paul E. McKenney <
> > paulmck@...ux.vnet.ibm.com>:
> > > > >
> > > > > > On Fri, Nov 27, 2015 at 08:23:24PM +0800, fupan li wrote:
> > > > > > > Hi, Paul
> > > > > > >
> > > > > > > On my Wildcat_Pass (Haswell) board, the system boot will hang as
> > > > below.
> > > > > > > The kernel is preempt-rt kernel.
> > > > > > > But it was not reproduced every time, about 1 time per 5-10
> > boots.
> > > > > >
> > > > > > CCing LMKL and linux-rt-users in case anyone else is seeing this.
> > > > > >
> > > > > > OK, let's take a look at the stall warning...
> > > > > >
> > > > > > [ . . . ]
> > > > > >
> > > > > > > Btrfs loaded
> > > > > > > console [netcon0] enabled
> > > > > > > netconsole: network logging started
> > > > > > > rtc_cmos 00:00: setting system clock to 2015-11-11 18:18:03 UTC
> > > > > > (1447265883)
> > > > > > > usb 1-9: new full-speed USB device number 3 using xhci_hcd
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
> > > > > > > usb 1-9: ep 0x81 - rounding interval to 64 microframes, ep desc
> > says
> > > > 80
> > > > > > > microframes
> > > > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > > > >
> > > > > >
> > > >
> > /devices/pci0000:00/0000:00:14.0/usb1/1-9/1-9:1.0/0003:046B:FF10.0002/input/input3
> > > > > > > hid-generic 0003:046B:FF10.0002: input: USB HID v1.10 Keyboard
> > > > [American
> > > > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > > > usb-0000:00:14.0-9/input0
> > > > > > > input: American Megatrends Inc. Virtual Keyboard and Mouse as
> > > > > > >
> > > > > >
> > > >
> > /devices/pci0000:00/0000:00:14.0/usb1/1-9/1-9:1.1/0003:046B:FF10.0003/input/input4
> > > > > > > hid-generic 0003:046B:FF10.0003: input: USB HID v1.10 Mouse
> > [American
> > > > > > > Megatrends Inc. Virtual Keyboard and Mouse] on
> > > > usb-0000:00:14.0-9/input1
> > > > > > > ixgbe 0000:02:00.0: registered PHC device on eth2
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
> > > > > > > ixgbe 0000:02:00.1: registered PHC device on eth3
> > > > > > > IPv6: ADDRCONF(NETDEV_UP): eth3: link is not ready
> > > > > > > igb 0000:04:00.0 eth0: igb: eth0 NIC Link is Up 100 Mbps Full
> > Duplex,
> > > > > > Flow
> > > > > > > Control: RX/TX
> > > > > > > IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > Sending DHCP requests ., OK
> > > > > > > random: nonblocking pool is initialized
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=f48/0/0 softirq=0/0 fqs=0
> > > > > > > 17: (59 GPs behind) idle=55a/0/0 softirq=0/0 fqs=0
> > > > > >
> > > > > > So CPUs 12 and 17 are stalling the grace period.
> > > > > >
> > > > > > > (detected by 5, t=21002 jiffies, g=-237, c=-238, q=136)
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > But CPU 12 seems to be idle (as is in fact indicated by the
> > > > "idle=55a/0/0"
> > > > > > above), so the grace-period kthread should have reported a
> > quiescent
> > > > > > state on its behalf.
> > > > > >
> > > > > > > Task dump for CPU 17:
> > > > > > > swapper/17      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859063eb8 ffffffff810b7ff5
> > ffff880859063ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 0000000000000011
> > ffffffff82140220
> > > > > > >  ffff880859060000 ffff880859063ea8 ffffffff8140f1f7
> > ffff880859063ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > And the same for CPU 17.
> > > > > >
> > > > > > > rcu_preempt kthread starved for 21002 jiffies!
> > > > > >
> > > > > > But the grace-period kthread hasn't had a chance to run for more
> > > > > > than 21 seconds, which explains why the grace period is not ending.
> > > > > >
> > > > > > Are you starting up a heavy real-time workload at boot time?
> > > > >
> > > > > No,  just a normal boot, and these stalls were happened before
> > > > > systemd services running.
> > > >
> > > > Interesting.  My testing show v4.1 being OK, with the first issues
> > showing
> > > > up somewhere between v4.1 and v4.2.  Or at least v4.1 is reliable
> > enough
> > > > that is passes 42 hours of focused rcutorture testing, where v4.2 tends
> > > > to fail in under two hours.  And it seems to happen only on multisocket
> > > > systems -- I seem to be able to hammer as hard as I want on my
> > four-core
> > > > (eight hardware thread) laptop without an issue.
> > > >
> > > > So it might be something between v4.1 and
> > v4.1.12-rt8-8.0.0.0_preempt-rt.
> > > > Could you please test commits in that range?
> > > >
> > > Hmmm, that's a narrow scope, I'll try to bisect  between them and find
> > > the key commit.
> >
> > Please let me know how it goes!
> >
> I'm still working it.

I know that feeling...

> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Calling CRDA to update world regulatory domain
> > > > > > > cfg80211: Exceeded CRDA call max attempts. Not calling CRDA
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=04c/0/0 softirq=0/0 fqs=1
> > > > > >
> > > > > > At this point, CPU 17's quiescent state has been recorded.
> > > > > >
> > > > > > > (detected by 24, t=84007 jiffies, g=-237, c=-238, q=147)
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > CPU 12 still looks idle.  Though it does appear to have
> > transitioned
> > > > > > between idle and non-idle several times.
> > > > > >
> > > > > > > rcu_preempt kthread starved for 63005 jiffies!
> > > > > >
> > > > > > And the last time the grace-period kthread ran was about the time
> > of
> > > > > > the first stall-warning message.  Strange...
> > > > > >
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=0a8/0/0 softirq=0/0 fqs=0
> > > > > > > 17: (60 GPs behind) idle=6b4/0/0 softirq=0/0 fqs=0
> > > > > > > (detected by 11, t=21002 jiffies, g=-236, c=-237, q=151)
> > > > > >
> > > > > > And the old grace period (g=-237, c=-238) finished and a new one
> > has
> > > > > > been stalled for 21 seconds.  Again, CPUs 12 and 17 are idle and
> > > > > > stalling the grace period.
> > > > > >
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > > > Task dump for CPU 17:
> > > > > > > swapper/17      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859063eb8 ffffffff810b7ff5
> > ffff880859063ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 0000000000000011
> > ffffffff82140220
> > > > > > >  ffff880859060000 ffff880859063ea8 ffffffff8140f1f7
> > ffff880859063ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > The stack traces are consistent with idle CPUs.
> > > > > >
> > > > > > > rcu_preempt kthread starved for 21002 jiffies!
> > > > > >
> > > > > > And the grace-period kthread apparently hasn't gotten a chance to
> > run
> > > > > > at all during the grace period.  Strange...  It should get
> > awakened by
> > > > > > a timeout every three jiffies or so.
> > > > > >
> > > > > Was it possible that the grace-period kthread was bound to a
> > > > > no_hz cpu and it cannot be awakened by timer interrupts?
> > > >
> > > > Timer interrupts should have no problem awakening tasks bound to no_hz
> > > > CPUs.  But how many non-no_hz CPUs do you have?  If there is only the
> > > > boot CPU, that boot CPU might be horribly overloaded, which could
> > result
> > > > in stalls.
> > > >
> > > I didn't know how to check the number of non-no_hz CPUs, is there any way
> > > to specify the number of none-no_hz cpus when boot?
> >
> > If you don't specify anything at boot and if you didn't build with
> > CONFIG_NO_HZ_FULL_ALL=y, then all CPUs are non-no_hz CPUs.  But either
> > way, they are reported during early boot.  And in your earlier dmesg,
> > none were reported.  But please send your .config file.
> >
> > Hi, Paul
> 
> I attached my config file, please have a look.

I see no sign of NO_HZ_FULL.

> > > I don't understand why only some of the cpus stalls, especially the idle
> > > cpus stall,
> > > the system shouldn't hang, especially there so many cpus should work, but
> > > it hang
> > > there and didn't continue to boot the system.
> >
> > Well, bugs should never happen, should they?  But they do, and they
> > need to be tracked down and fixed.
> >
> > Is the system responsive to sysrq?  If so, sysrq-T could help get
> > useful information.
> >
> > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=1aa/0/0 softirq=0/0 fqs=1
> > > > > > > (detected by 0, t=84007 jiffies, g=-236, c=-237, q=159)
> > > > > >
> > > > > > The same grace period is still stalled after a total of 84 seconds,
> > > > > > and again CPU 17's quiescent state has been reported.
> > > > > >
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > >
> > > > > > And again, the stack is consistent with the CPU being idle.
> > > > > >
> > > > > > > rcu_preempt kthread starved for 63005 jiffies!
> > > > > >
> > > > > > And again the grace-period kthread seems to have gotten a chance to
> > > > > > run during the first RCU CPU stall warning, but not since then.
> > > > > >
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=20c/0/0 softirq=0/0 fqs=0
> > > > > > > 17: (61 GPs behind) idle=810/0/0 softirq=0/0 fqs=0
> > > > > > > (detected by 40, t=21002 jiffies, g=-235, c=-236, q=162)
> > > > > >
> > > > > > And again, it looks like the stall warning gave the grace-period
> > > > kthread
> > > > > > another chance to run, thus completing the earlier grace period.
> > We
> > > > > > now have another grace period stalled for 21 seconds, again by
> > CPUs 12
> > > > > > and 17.
> > > > > >
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > > > Task dump for CPU 17:
> > > > > > > swapper/17      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859063eb8 ffffffff810b7ff5
> > ffff880859063ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 0000000000000011
> > ffffffff82140220
> > > > > > >  ffff880859060000 ffff880859063ea8 ffffffff8140f1f7
> > ffff880859063ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > > > rcu_preempt kthread starved for 21002 jiffies!
> > > > > >
> > > > > > And again idle stacks, and again the RCU grace-period kthread has
> > not
> > > > > > been allowed to run at all since the grace period stared.
> > > > > >
> > > > > > > INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > > > 12: (0 ticks this GP) idle=30e/0/0 softirq=0/0 fqs=1
> > > > > > > (detected by 5, t=84007 jiffies, g=-235, c=-236, q=170)
> > > > > > > Task dump for CPU 12:
> > > > > > > swapper/12      R  running task        0     0      1 0x00200000
> > > > > > >  ffffffff8140f1f7 ffff880859037eb8 ffffffff810b7ff5
> > ffff880859037ea8
> > > > > > >  ffffffff810759ef ffffffff82140220 000000000000000c
> > ffffffff82140220
> > > > > > >  ffff880859034000 ffff880859037ea8 ffffffff8140f1f7
> > ffff880859037ec8
> > > > > > > Call Trace:
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff810b7ff5>] ? rcu_eqs_enter_common+0x85/0x240
> > > > > > >  [<ffffffff810759ef>] ? __atomic_notifier_call_chain+0x4f/0x70
> > > > > > >  [<ffffffff8140f1f7>] ? debug_smp_processor_id+0x17/0x20
> > > > > > >  [<ffffffff8100dd52>] ? mwait_idle+0x42/0x1d0
> > > > > > >  [<ffffffff8100ea8f>] ? arch_cpu_idle+0xf/0x20
> > > > > > >  [<ffffffff8109688e>] ? cpu_startup_entry+0xae/0x300
> > > > > > >  [<ffffffff8103aabc>] ? start_secondary+0x12c/0x140
> > > > > > > rcu_preempt kthread starved for 63005 jiffies!
> > > > > >
> > > > > > And again, CPU 17's grace period has been recorded, and the
> > > > grace-period
> > > > > > kthread hasn't been allowed to run since the last RCU CPU stall
> > > > warning.
> > > > > >
> > > > > > When a stall warning is printed, there is also a resched_cpu() on
> > the
> > > > > > holdout CPU.  Perhaps that is also giving the grace-period kthread
> > a
> > > > > > chance to run?  Alternatively, perhaps the overhead of all the
> > printing
> > > > > > is letting the grace-period kthread run?
> > > > > >
> > > > > > I do sometimes see similar stalls, but under heavy rcutorture load
> > and
> > > > > > when running qemu/kvm.  I am considering this to be a bug
> > (somewhere!)
> > > > and
> > > > > > am tracking it down, but I suspect that your boot-time stalls are
> > from
> > > > > > something different.  Yours is the only report of this that I have
> > seen
> > > > > > thus far.
> > > > > >
> > > > > > So what are you starting up at boot time?
> > > > >
> > > > > This is just a normal boot, and I didn't run special heavy load
> > tasks.
> > > > > BTW, I had try to specified rcu kthread's priority to the 99 from the
> > > > boot
> > > > > command,
> > > > > but the stall still happened sometimes.
> > > >
> > > > What priority are the softirq kthreads running at?  RCU depends on them
> > > > to make forward progress as well.
> > > >
> > > since the system hang there and I didn't know how to check their
> > priority.
> >
> > Well, you said that the system only hangs sometimes.  So when it does
> > boot successfully, please check their priority.
> 
> It is 19.

OK, so whatever is blocking this should be running at prio 20 or higher.
Or there is some other bug involved.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ