netdev - Fw: [Bug 71351] New: "INFO: rcu_sched detected stalls on CPUs/tasks" on high server load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Date:	Sun, 2 Mar 2014 16:42:46 -0800
From:	Stephen Hemminger <stephen@...workplumber.org>
To:	netdev@...r.kernel.org
Subject: Fw: [Bug 71351] New: "INFO: rcu_sched detected stalls on
 CPUs/tasks" on high server load



Begin forwarded message:

Date: Sat, 1 Mar 2014 10:11:16 -0800
From: "bugzilla-daemon@...zilla.kernel.org" <bugzilla-daemon@...zilla.kernel.org>
To: "stephen@...workplumber.org" <stephen@...workplumber.org>
Subject: [Bug 71351] New: "INFO: rcu_sched detected stalls on CPUs/tasks" on high server load


https://bugzilla.kernel.org/show_bug.cgi?id=71351

            Bug ID: 71351
           Summary: "INFO: rcu_sched detected stalls on CPUs/tasks" on
                    high server load
           Product: Networking
           Version: 2.5
    Kernel Version: 3.10.22, 3.11, 3.13.5
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: shemminger@...ux-foundation.org
          Reporter: exa.exa@...il.com
        Regression: No

After upgrading the kernel on several of my machines from 3.6.9 to 3.13.3, I've
seen following problem happen randomly after some amount of time:

[ 5727.864173] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by
5, t=60002 jiffies, g=602758, c=602757, q=24880)
[ 5727.864179] sending NMI to all CPUs:
[ 5727.864183] NMI backtrace for cpu 5
[ 5727.864186] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 3.13.3 #4
[ 5727.864187] Hardware name: Supermicro X8SIE/X8SIE, BIOS 1.0c 05/27/2010
[ 5727.864189] task: ffff880236095210 ti: ffff8802360ba000 task.ti:
ffff8802360ba000
[ 5727.864191] RIP: 0010:[<ffffffff812bd031>]  [<ffffffff812bd031>]
__const_udelay+0x21/0x30

.....
(Full dmesg in attachment).

I don't know where to start searching. The machines do

- HFSC traffic shaping of cca 500Mbits of data (low CPU load, not many classes)
- e1000 and/or igb networking
- some (not very hard) disk&CPU load from postgresql.
- irqbalance for (well) IRQ balancing
- bIRD routing daemon with OSPF.

When this problem happens, one of following thing usually (not everytime and
randomly) starts failing:

- Network interrupts start to take away more CPU (from 2-3% on each core to
around 50% on each core)
- HFSC stops working and it doesn't do anything at all
- HFSC fails and no packets run through.

I've been unable yet to see this in lab setup (it's on production servers) so I
can't produce much useful debug output - if there's some more useful thing I
should attach here, tell me.

I'm currently trying to bisect a bit to see what change could have introduced
this problem (it doesn't happen on 3.1.1 to 3.6.9 and it certainly happens from
3.10.22 to 3.13.5) but it's quite a slow process because of waiting several
hours for the bug to occur.

Dmesg's with the error description are attached.

So far I tried to isolate following things:

- e1000 or igb driver (happens on both)
- HFSC (seems to happen even without HFSC)
- GRO, TSO, ... etc for network stuff (no effect)
- C-state idle drivers (I've been told that some NICs don't play well when
C-states go above 1, but it didn't help much).

Thanks for any help on solving this.
-mk

PS. because the machines are doing networking and this seems triggered by heavy
network usage, I posted this in "networking" component, but I'm not sure
whether it's really networking - please reassign if it looks otherwise.

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html