linux-kernel - Re: rcu stalls and soft lockups with recent kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56E9868E.4000006@badula.org>
Date:	Wed, 16 Mar 2016 12:15:10 -0400
From:	Ion Badulescu <ionut@...ula.org>
To:	linux-kernel@...r.kernel.org
Subject: Re: rcu stalls and soft lockups with recent kernels

Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from 
CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu 
handle its own RCU callbacks completely fixes the problems for us.

Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config 
for fedora and rhel7. Ho-humm... anybody interesting in tracking this 
down further?

The bug is likely to be a missing wakeup for the rcuo* threads. It would 
explain why the stalls get resolved eventually, maybe when another RCU 
callback gets scheduled (properly this time!) and the threads wake up 
and process their entire queue. But it's just speculation at this point.

And one final point: the RCU stall stack traces are actually rather 
useless when the RCU callbacks are delegated to kernel threads. What one 
particular cpu may be doing when it's fallen behind on RCU grace periods 
isn't very interesting, given that its rcuo* threads could be running 
anywhere, on any cpu.

Thanks,
-Ion


On 02/04/2016 02:12 PM, Ion Badulescu wrote:
> Hello,
>
> We run a compute cluster of about 800 or so machines here, which makes
> heavy use of NFS and fscache (on a dedicated local drive with an ext4
> filesystem) and also exercises the other local drives pretty hard. All
> the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
> nice level 1.
>
> Over the past month or two, we've been seeing some strange and seemingly
> random rcu stalls and soft cpu lockups when running jobs on the cluster.
> The stalls are bad enough to cause hard drives to get kicked out of MD
> arrays, and even to get offlined altogether by the SCSI layer. The
> kernel running on these machines is based on 3.18.12 with some local
> patches, which we've been running quite happily since early May 2015.
> It's unclear what started triggering the stalls after all these months,
> as we've been unable to correlate them to any particular jobs. The one
> thing that's clear is that they tend to happen in bunches on multiple
> machines at the same time, so whatever it is it's some centralized
> condition that triggers them. It could be a certain type of job, or it
> could be the state of the centralized NFS servers they access.
>
> In an effort to solve the issue and isolate its cause, we upgraded the
> kernel on about 80 of those machines to the latest 4.4.1 kernel, this
> time keeping out most of the local patches we had. We also enabled some
> more kernel debugging options, including lockdep. Only a few local
> patches were kept in:
> - printing some extra information in sysrq
> - disabling audit-to-printk
> - perf build fixes
> - changing the slab calculate_order algorithm to favor small allocation
> orders whenever possible.
>
> The full kernel config is available at
> http://www.badula.org/config-4.4.1 and the combined patch with our local
> changes is at http://www.badula.org/linux-ion-git-4.4.1.diff
>
> Unfortunately, the stalls did not go away. Many of the machines running
> 4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
> to this one:
>
>
> Feb  2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO:
> rcu_sched detected stalls on CPUs/tasks:
> Feb  2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/140000000000001/0
> softirq=9090641/9090641 fqs=7000
> Feb  2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
> Feb  2 04:26:02   running task        0 328895      2 0x00080088   6
> FAIR  -20        6
> Feb  2 04:26:02
> Feb  2 04:26:02  ffffffffffffff10 ffffffff8152abaf 0000000000000000
> 0000000000000004
> Feb  2 04:26:02  000000043121b8b0 ffff881001149808 0000000000000296
> 0208002000000000
> Feb  2 04:26:02  0000000000000296 ffff88003121b908 ffffffff810968fb
> ffff8810010ed9c8
> Feb  2 04:26:02 Call Trace:
> Feb  2 04:26:02  [<ffffffff8152abaf>] ? _raw_spin_lock_irqsave+0xf/0x40
> Feb  2 04:26:02  [<ffffffff810968fb>] ? finish_wait+0x6b/0x90
> Feb  2 04:26:02  [<ffffffff814694aa>] ? sk_stream_wait_memory+0x24a/0x2d0
> Feb  2 04:26:02  [<ffffffff810966a0>] ? woken_wake_function+0x20/0x20
> Feb  2 04:26:02  [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
> Feb  2 04:26:02  [<ffffffff8145d22d>] ? release_sock+0xfd/0x150
> Feb  2 04:26:02  [<ffffffff814bfc16>] ? tcp_sendpage+0xd6/0x5e0
> Feb  2 04:26:02  [<ffffffff814e7bc0>] ? inet_sendpage+0x50/0xe0
> Feb  2 04:26:02  [<ffffffffa03b4ea5>] ? xs_nospace+0x75/0xf0 [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb  2 04:26:02  [<ffffffff81043946>] ? read_hpet+0x16/0x20
> Feb  2 04:26:02  [<ffffffff810b7ec2>] ? ktime_get+0x52/0xc0
> Feb  2 04:26:02  [<ffffffffa03b2133>] ? xprt_transmit+0x63/0x3a0 [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb  2 04:26:02  [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
> Feb  2 04:26:02  [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03ae538>] ? call_transmit+0x1d8/0x2c0 [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
> [sunrpc]
> Feb  2 04:26:02  [<ffffffffa03bb4e9>] ? __rpc_execute+0x89/0x3c0 [sunrpc]
> Feb  2 04:26:02  [<ffffffff81081f15>] ? finish_task_switch+0x125/0x260
> Feb  2 04:26:02  [<ffffffffa03bb8e5>] ? rpc_async_schedule+0x15/0x20
> [sunrpc]
> Feb  2 04:26:02  [<ffffffff8106fad8>] ? process_one_work+0x148/0x450
> Feb  2 04:26:02  [<ffffffff8106ff12>] ? worker_thread+0x132/0x600
> Feb  2 04:26:02  [<ffffffff81083862>] ? default_wake_function+0x12/0x20
> Feb  2 04:26:02  [<ffffffff81096566>] ? __wake_up_common+0x56/0x90
> Feb  2 04:26:02  [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
> Feb  2 04:26:02  [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
> Feb  2 04:26:02  [<ffffffff8107511c>] ? kthread+0xcc/0xf0
> Feb  2 04:26:02  [<ffffffff81075050>] ?
> kthread_freezable_should_stop+0x70/0x70
> Feb  2 04:26:02  [<ffffffff8152b0df>] ? ret_from_fork+0x3f/0x70
> Feb  2 04:26:02  [<ffffffff81075050>] ?
> kthread_freezable_should_stop+0x70/0x70
>
>
> So we added a cond_resched() call inside the loop in __rpc_execute, if
> the loop made more than 20 iterations. While clearly just a bandaid, and
> maybe also in the wrong place, we just wanted to see if it would help.
> And indeed it did, because the very common sunrpc-related stalls went
> away. Instead, this ext4-related stall started showing up:
>
>
> [91919.498218] INFO: rcu_sched self-detected stall on CPU
> [91919.498221] INFO: rcu_sched detected stalls on CPUs/tasks:
> [91919.498235]  15-...: (4 ticks this GP) idle=dbf/140000000000001/0
> softirq=16345141/16345141 fqs=1
> [91919.498238]  (detected by 10, t=178100 jiffies, g=4061489, c=4061488,
> q=0)
> [91919.498239] Task dump for CPU 15:
> [91919.498249] condor_starter  R  running task        0 506564 502795
> 0x00080080  15 FAIR    0     0-23    10504     2591
> [91919.498252]  0000000000000004 ffff880049bb7958 0000000000000246
> 0000000000000002
> [91919.498254]  0000000000000000 ffff880049bb7938 ffffffff812c83b6
> 0000000000000000
> [91919.498255]  ffffffff81156070 ffff880000000000 ffffffff00000000
> 0000000000000246
> [91919.498255] Call Trace:
> [91919.498261]  [<ffffffff812c83b6>] ? __radix_tree_lookup+0x36/0x100
> [91919.498265]  [<ffffffff81156070>] ? find_get_pages+0x280/0x280
> [91919.498267]  [<ffffffff81156165>] ? find_get_entry+0xf5/0x200
> [91919.498268]  [<ffffffff8115617d>] ? find_get_entry+0x10d/0x200
> [91919.498270]  [<ffffffff81156070>] ? find_get_pages+0x280/0x280
> [91919.498272]  [<ffffffff81158288>] ? pagecache_get_page+0x38/0x1d0
> [91919.498275]  [<ffffffff8120bff2>] ? alloc_buffer_head+0x42/0x50
> [91919.498277]  [<ffffffff8120c86d>] ? __find_get_block_slow+0x4d/0x160
> [91919.498280]  [<ffffffff8159f35b>] ? _raw_spin_unlock+0x2b/0x40
> [91919.498281]  [<ffffffff8120d74c>] ? create_empty_buffers+0x9c/0xc0
> [91919.498284]  [<ffffffff810aabb0>] ? mark_held_locks+0x70/0x90
> [91919.498285]  [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
> [91919.498286]  [<ffffffff810aae7f>] ? trace_hardirqs_on_caller+0xff/0x1c0
> [91919.498290]  [<ffffffff81001017>] ? trace_hardirqs_on_thunk+0x17/0x19
> [91919.498292]  [<ffffffff810c40cc>] ? rcu_irq_exit+0x6c/0xb0
> [91919.498293]  [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
> [91919.498296]  [<ffffffff812ceb95>] ?
> copy_user_enhanced_fast_string+0x5/0x10
> [91919.498299]  [<ffffffff812d54ec>] ?
> iov_iter_copy_from_user_atomic+0x9c/0x240
> [91919.498300]  [<ffffffff8115535e>] ? generic_perform_write+0xfe/0x1d0
> [91919.498302]  [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
> [91919.498305]  [<ffffffff81157466>] ?
> __generic_file_write_iter+0x196/0x1f0
> [91919.498315]  [<ffffffffa007dd31>] ? ext4_file_write_iter+0x121/0x3a0
> [ext4]
> [91919.498318]  [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
> [91919.498320]  [<ffffffff811d0abf>] ? __vfs_write+0xcf/0xf0
> [91919.498322]  [<ffffffff811d2123>] ? vfs_write+0xc3/0x1a0
> [91919.498323]  [<ffffffff811d22e9>] ? SyS_write+0x59/0xc0
> [91919.498325]  [<ffffffff8159f5d7>] ? entry_SYSCALL_64_fastpath+0x12/0x6f
> [91919.498327] rcu_sched kthread starved for 178097 jiffies! g4061489
> c4061488 f0x0 s3 ->state=0x1
> [91919.750048]
> [91919.751744]  15-...: (4 ticks this GP) idle=dbf/140000000000001/0
> softirq=16345141/16345141 fqs=2
> [91919.761832]   (t=178100 jiffies g=4061489 c=4061488 q=0)
> [91919.767820] Task dump for CPU 15:
> [91919.767821] condor_starter  R  running task        0 506564 502795
> 0x00080088  15 FAIR    0     0-23    10504     2591
> [91919.767830]  0000000000000010 ffff88207fc63b58 ffffffff8108aec5
> ffffffff8108acf8
> [91919.776246]  0000000000000001 0000000033322d30 ffffffff817e3471
> 00ff882000000000
> [91919.784645]  0000000000000000 0000000000000000 000000000000000f
> ffffffff81a4bb80
> [91919.793038] Call Trace:
> [91919.795830]  <IRQ>  [<ffffffff8108aec5>] sched_show_task+0x235/0x450
> [91919.803003]  [<ffffffff8108acf8>] ? sched_show_task+0x68/0x450
> [91919.809569]  [<ffffffff810b73af>] ? vprintk_default+0x1f/0x30
> [91919.816036]  [<ffffffff81154de9>] ? printk+0x46/0x48
> [91919.821625]  [<ffffffff8108b11f>] dump_cpu_task+0x3f/0x50
> [91919.827700]  [<ffffffff810c3594>] rcu_dump_cpu_stacks+0x84/0xc0
> [91919.834360]  [<ffffffff810c3ae9>] check_cpu_stall+0x1d9/0x290
> [91919.840823]  [<ffffffff810ad66c>] ? __lock_acquire+0x48c/0x17e0
> [91919.847504]  [<ffffffff810c56f2>] rcu_check_callbacks+0x102/0x3a0
> [91919.854371]  [<ffffffff810da629>] ? tick_sched_timer+0x29/0x80
> [91919.860936]  [<ffffffff8105f0a8>] ? __raise_softirq_irqoff+0x28/0x80
> [91919.868082]  [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
> [91919.879805]  [<ffffffff810c9fa9>] update_process_times+0x39/0x70
> [91919.886561]  [<ffffffff810da519>] tick_sched_handle+0x39/0x70
> [91919.893039]  [<ffffffff810da644>] tick_sched_timer+0x44/0x80
> [91919.899435]  [<ffffffff810caa56>] __hrtimer_run_queues+0xe6/0x2c0
> [91919.906286]  [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
> [91919.913042]  [<ffffffff810da600>] ? tick_nohz_handler+0xb0/0xb0
> [91919.919701]  [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
> [91919.926365]  [<ffffffff810d09a8>] ?
> ktime_get_update_offsets_now+0xb8/0x170
> [91919.934191]  [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
> [91919.940948]  [<ffffffff810cadbd>] hrtimer_interrupt+0x9d/0x1a0
> [91919.947518]  [<ffffffff81090274>] ?
> vtime_common_account_irq_enter+0x34/0x50
> [91919.955446]  [<ffffffff8103b529>] local_apic_timer_interrupt+0x39/0x60
> [91919.962786]  [<ffffffff815a1fd5>] smp_apic_timer_interrupt+0x45/0x59
> [91919.969930]  [<ffffffff815a035e>] apic_timer_interrupt+0x8e/0xa0
> [91919.976687]  <EOI>  [<ffffffff812ceb95>] ?
> copy_user_enhanced_fast_string+0x5/0x10
> [91919.985237]  [<ffffffff812d54ec>] ?
> iov_iter_copy_from_user_atomic+0x9c/0x240
> [91919.993274]  [<ffffffff8115535e>] generic_perform_write+0xfe/0x1d0
> [91920.000229]  [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
> [91920.007083]  [<ffffffff81157466>] __generic_file_write_iter+0x196/0x1f0
> [91920.014529]  [<ffffffffa007dd31>] ext4_file_write_iter+0x121/0x3a0
> [ext4]
> [91920.022161]  [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
> [91920.028742]  [<ffffffff811d0abf>] __vfs_write+0xcf/0xf0
> [91920.034629]  [<ffffffff811d2123>] vfs_write+0xc3/0x1a0
> [91920.040413]  [<ffffffff811d22e9>] SyS_write+0x59/0xc0
> [91920.046098]  [<ffffffff8159f5d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
>
>
> We also saw some cases of soft lockups showing no stack trace at all and
> RIP pointing to userspace memory:
>
>
> [92163.555725] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s!
> [strat_sim:570566]
> [92163.564455] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
> grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
> fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
> iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg lpc_ich
> mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core ixgbe ptp
> pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
> acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
> dm_region_hash dm_log dm_mod
> [92163.614893] irq event stamp: 12063822
> [92163.619020] hardirqs last  enabled at (12063821):
> [<ffffffff815a00bf>] retint_user+0x18/0x20
> [92163.628536] hardirqs last disabled at (12063822):
> [<ffffffff815a0339>] apic_timer_interrupt+0x69/0xa0
> [92163.638920] softirqs last  enabled at (12063814):
> [<ffffffff8105fa08>] __do_softirq+0x1f8/0x390
> [92163.648723] softirqs last disabled at (12063801):
> [<ffffffff8105fceb>] irq_exit+0xfb/0x110
> [92163.658042] CPU: 7 PID: 570566 Comm: strat_sim Tainted: G
> I     4.4.1-el6.ia32e.lime.0 #1
> [92163.668134] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
> SE5C610.86B.01.01.0011.081020151200 08/10/2015
> [92163.679871] task: ffff880838a30000 ti: ffff8808c270c000 task.ti:
> ffff8808c270c000
> [92163.688302] RIP: 0033:[<0000000003cd9e89>]  [<0000000003cd9e89>]
> 0x3cd9e89
> [92163.696036] RSP: 002b:00007ffdd497e168  EFLAGS: 00000202
> [92163.702009] RAX: 000000002860dad8 RBX: 00000000285fb790 RCX:
> 00000000285fb730
> [92163.710024] RDX: 0000000015546340 RSI: 0000000000000001 RDI:
> 000000002861afb0
> [92163.718040] RBP: 00000000285cd220 R08: 0000000015476050 R09:
> 000000000000000a
> [92163.726055] R10: 0000000000000001 R11: 0000000000000009 R12:
> ffffffffffffff80
> [92163.734071] R13: ffffffffffffff80 R14: 000000000000007f R15:
> 000000002861afe8
> [92163.742088] FS:  00002b21c95e3de0(0000) GS:ffff88103fce0000(0000)
> knlGS:0000000000000000
> [92163.751199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [92163.757661] CR2: 00002b21e3bcbd40 CR3: 00000004b3087000 CR4:
> 00000000001406e0
>
> [these are traces from different machines, btw, so don't try to read
> anything into their timestamps]
>
> At this point we're kinda stumped. Everything seems to point to some
> problem at a pretty low level, not related to sunrpc or ext4 or any
> other high level subsystem... but we can't quite figure out what it is.
>
> Any insights or potential fixes would be highly appreciated.
>
> Thanks,
> -Ion
>
>