linux-kernel - Re: [rcu_sched stall] regression/miss-config ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 16 May 2016 17:58:20 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Santosh Shilimkar <santosh.shilimkar@...cle.com>
Cc:	linux-kernel@...r.kernel.org, Sasha Levin <sasha.levin@...cle.com>
Subject: Re: [rcu_sched stall] regression/miss-config ?

On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> >>On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> >>>On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@...cle.com wrote:
> >>>>On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >>>>>Hi Paul,
> >>>>>
> >>>>>I was asking Sasha about [1] since other folks in Oracle
> >>>>>also stumbled upon similar RCU stalls with v4.1 kernel in
> >>>>>different workloads. I was reported similar issue with
> >>>>>RDS as well and looking at [1], [2], [3] and [4], thought
> >>>>>of reaching out to see if you can help us to understand
> >>>>>this issue better.
> >>>>>
> >>>>>Have also included RCU specific config used in these
> >>>>>test(s). Its very hard to reproduce the issue but one of
> >>>>>the data point is, it reproduces on systems with larger
> >>>>>CPUs(64+). Same workload with less than 64 CPUs, don't
> >>>>>show the issue. Someone also told me, making use of
> >>>>>SLAB instead SLUB allocator makes difference but I
> >>>>>haven't verified that part for RDS.
> >>>>>
> >>>>>Let me know your thoughts. Thanks in advance !!
> >>>>>
> >>>>One of my colleague told me the pastebin server I used
> >>>>is Oracle internal only so adding the relevant logs along
> >>>>with email.
> >>>>
> >>
> >>[...]
> >>
> >>>>>[1] https://lkml.org/lkml/2014/12/14/304
> >>>>
> >>>>
> >>>>[2]  Log 1 snippet:
> >>>>-----------------------------------------------------------------
> >>>>INFO: rcu_sched self-detected stall on CPU
> >>>>INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> >>>>g=66023 c=66022 q=0)
> >>>>Task dump for CPU 54:
> >>>>ksoftirqd/54    R  running task        0   389      2 0x00000008
> >>>> 0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
> >>>> ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
> >>>> ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
> >>>>Call Trace:
> >>>> <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
> >>>> [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
> >>>> [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
> >>>> [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
> >>>> [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
> >>>> [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
> >>>> [<ffffffff810e9772>] update_process_times+0x42/0x70
> >>>> [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
> >>>> [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
> >>>> [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
> >>>> [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
> >>>> [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
> >>>> [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
> >>>> [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
> >>>> [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
> >>>> <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
> >>>> [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
> >>>> [<ffffffff8118e775>] __free_pages+0x25/0x40
> >>>> [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
> >>>> [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
> >>>> [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
> >>>> [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
> >>>> [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
> >>>> [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
> >>>> [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
> >>>
> >>>The most likely possibility is that there is a 60-second-long loop in
> >>>one of the above functions.  This is within bottom-half execution, so
> >>>unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> >>>loop, but outside of any RCU read-side critical section does not work.
> >>>
> >>First of all thanks for explanation.
> >>
> >>There is no loop which can last for 60 seconds in above code since
> >>its just completion queue handler used to free up buffers much like
> >>NIC
> >>drivers bottom half(NAPI). Its done in tasklet context for latency
> >>reasons which RDS care most. Just to get your attention, the RCU
> >>stall is also seen with XEN code too. Log for it end of the email.
> >>
> >>Another important observation is, for RDS if we avoid higher
> >>order page(s) allocation, issue is not reproducible so far.
> >>In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
> >>the system continues to run without any issue, so the loop scenario
> >>is ruled out more or less.
> >>
> >>To be specific, with PAGE_SIZE allocations, alloc_pages()
> >>is just allocating a page and __free_page() is used
> >>instead of __free_pages() from below snippet.
> >>
> >>------------------
> >>if (bytes >= PAGE_SIZE)
> >>	page = alloc_pages(gfp, get_order(bytes));
> >>
> >>.....
> >>
> >>(rm->data.op_sg[i].length <= PAGE_SIZE) ?
> >>__free_page(sg_page(&rm->data.op_sg[i])) :
> >>__free_pages(sg_page(&rm->data.op_sg[i]),
> >>get_order(rm->data.op_sg[i].length));
> >>----------------------------
> >
> >This sounds like something to take up with the mm folks.
> >
> Sure. Will do once the link between two issues is established.

Fair enough!

> >>>Therefore, if there really is a loop here, one fix would be to
> >>>periodically unwind back out to run_ksoftirqd(), but setting up so that
> >>>the work would be continued later.  Another fix might be to move this
> >>>from tasklet context to workqueue context, where cond_resched_rcu_qs()
> >>>can be used -- however, this looks a bit like networking code, which
> >>>does not always take kindly to being run in process context (though
> >>>careful use of local_bh_disable() and local_bh_enable() can sometimes
> >>>overcome this issue).  A third fix, which works only if this code does
> >>>not use RCU and does not invoke any code that does use RCU, is to tell
> >>>RCU that it should ignore this code (which will require a little work
> >>>on RCU, as it currently does not tolerate this sort of thing aside from
> >>>the idle threads).  In this last approach, event-tracing calls must use
> >>>the _nonidle suffix.
> >>>
> >>>I am not familiar with the RDS code, so I cannot be more specific.
> >>
> >>No worries. Since we saw the issue with XEN too, I was suspecting
> >>that somehow we didn't have RCU_TREE config setup correctly or
> >>some important RCU patch(s) missing in v4.1 which made it in
> >>later kernels.
> >>
> >>The only common denominator I saw between these two different
> >>usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
> >>which I believe triggers the rcu_sched() chain.
> >
> >Exactly!
> >
> >>I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
> >>get more information out of the system. Do you suggest any other
> >>RCU option(s)/patch(s) which we can help us to capture more
> >>information to understand the issue better. I wasn't sure about
> >>options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
> >>is also mentioned end of the email.
> >
> >Hmmm...  I just now noticed the "All QSes seen" message below.
> >That can happen if the load is quite high, and could as you say
> >one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> >This is not free, as it will mean more context switches involving
> >the RCU grace-period kthreads.
> >
> >Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> >smaller than 60, say, 21.  This will cause the stall warnings to leave
> >less time before splatting.
> >
> OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
> CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very
> high load on the systems.
> 
> >Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >you need more housekeeping CPUs than you currently have configured.
> >
> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> book keeping. Seems like without that clock-event code will just use
> CPU0 for things like broadcasting which might become bottleneck.
> This could explain connect the hrtimer_interrupt() path getting slowed
> down because of book keeping bottleneck.
> 
> $cat .config | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> # CONFIG_NO_HZ_FULL_ALL is not set
> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
> CONFIG_NO_HZ=y
> # CONFIG_RCU_FAST_NO_HZ is not set

Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
housekeeping tasks, including the RCU grace-period kthreads.  So you are
booting without any nohz_full boot parameter?  You can end up with the
same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
that you can with CONFIG_NO_HZ_FULL_ALL=y.

							Thanx, Paul