linux-kernel - Re: RCU stalls on heavy socket traffic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150309155947.GW5236@linux.vnet.ibm.com>
Date:	Mon, 9 Mar 2015 08:59:47 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:	linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: RCU stalls on heavy socket traffic

On Mon, Mar 09, 2015 at 04:26:51PM +0100, Sebastian Andrzej Siewior wrote:
> So I run "hackbench -g 600 -l 350 -s 250" which takes approx 77 seconds
> to complete. Then this popped up:
> 
> |INFO: rcu_preempt detected stalls on CPUs/tasks:
> | Tasks blocked on level-0 rcu_node (CPUs 0-3): P24858 P28514 P25185 P25184 P28713 P19549 P3139 P25275 P28474 P29062 P6703 P10 106 P14309 P27910 P4514 P14834 P28385 P21073 P27701 P642 P10340 P16939 P19147 P16949 P16945 P16952 P16941 P16937 P16938 P16946 P16954 P28664  P18701 P17782 P4875 P8873
> | Tasks blocked on level-0 rcu_node (CPUs 0-3): P24858 P28514 P25185 P25184 P28713 P19549 P3139 P25275 P28474 P29062 P6703 P10 106 P14309 P27910 P4514 P14834 P28385 P21073 P27701 P642 P10340 P16939 P19147 P16949 P16945 P16952 P16941 P16937 P16938 P16946 P16954 P28664  P18701 P17782 P4875 P8873

Definitely lots of tasks recently preempted within RCU read-side
critical sections!

For example, the P24858 says that the task with PID 24858 was preempted
(or blocked) while within an RCU read-side critical section that blocks
the current grace period, and that this task has not yet exited that
critical section.  This might be because that task has been preempted
throughout by a higher-priority task, that there is an infinite loop in
that critical section, or that the task is blocked on a spinlock that
is being held indefinitely by some other task (possibly because that
other task is being indefinitely preempted).

> | (detected by 0, t=5252 jiffies, g=2385, c=2384, q=41180)
> |hackbench       R  running task        0 24858   1995 0x00000000
> | ffff880058d9bb18 0000000000000082 0000000000012e40 ffff880058d9bfd8
> | 0000000000012e40 ffff8804226627c0 ffff880058d9bb38 ffff880058d9bfd8
> | 0000000000000001 ffff880058d5ccf8 0000000000000292 0000000000000002
> |Call Trace:
> | [<ffffffff81568e9f>] preempt_schedule+0x3f/0x60
> | [<ffffffff812d84d7>] ___preempt_schedule+0x35/0x67
> | [<ffffffff8156b725>] ? _raw_spin_unlock_irqrestore+0x25/0x30
> | [<ffffffff81088503>] try_to_wake_up+0x63/0x2f0
> | [<ffffffff8108881d>] default_wake_function+0xd/0x10
> | [<ffffffff8109b011>] autoremove_wake_function+0x11/0x40
> | [<ffffffff8109aa05>] __wake_up_common+0x55/0x90
> | [<ffffffff8109afd3>] __wake_up_sync_key+0x43/0x60
> | [<ffffffff8145c95e>] sock_def_readable+0x3e/0x70
> | [<ffffffff8150f9d1>] unix_stream_sendmsg+0x211/0x470
> | [<ffffffff81458b48>] sock_aio_write+0xf8/0x120
> | [<ffffffff8109ec19>] ? rt_up_read+0x19/0x20
> | [<ffffffff8116e855>] do_sync_write+0x55/0x90
> | [<ffffffff8116f435>] vfs_write+0x175/0x1f0
> | [<ffffffff8116fde4>] SyS_write+0x44/0xb0
> | [<ffffffff8156c1ed>] system_call_fastpath+0x16/0x1b
> 
> The other processes look more or less the same. I have the full splat
> here [0]. My understanding is that sock_def_readable() does a
> rcu_read_lock() which forbids a grace period. Since there are many
> processes (preempted) in this section, the grace period never starts
> since it never blocks new readers from getting into a read critical
> section.
> Is my understanding correct so far? Is it likely that -RT does not do
> something correctly to forbid such a situation or is it more or less
> "expected" ?
> 
> [0] https://breakpoint.cc/rt-rcu-stall.txt

If you don't have CONFIG_RCU_BOOST=y with priority set above that of
any CPU-bound components of your workload, then yes, this is expected
behavior.  The priority boosting will be inherited, so that this covers
the case of indefinite preemption.  Of course, this won't help with
infinite loops.

That said, someone was having trouble with the rcuoc kthreads that are
generated by CONFIG_RCU_BOOST=y when running NO_HZ_FULL.  So there might
need to be different kernel configs for different workloads.  Or we might
need some way to enable and disable boosting at boot time, I suppose.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/