linux-kernel - Re: [rcu_sched stall] regression/miss-config ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67eb4bf6-c3d2-b9af-30ff-713a6d75e773@oracle.com>
Date:	Mon, 16 May 2016 12:49:41 -0700
From:	Santosh Shilimkar <santosh.shilimkar@...cle.com>
To:	paulmck@...ux.vnet.ibm.com
Cc:	linux-kernel@...r.kernel.org, Sasha Levin <sasha.levin@...cle.com>
Subject: Re: [rcu_sched stall] regression/miss-config ?

On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
>> On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
>>> On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@...cle.com wrote:
>>>> On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
>>>>> Hi Paul,
>>>>>
>>>>> I was asking Sasha about [1] since other folks in Oracle
>>>>> also stumbled upon similar RCU stalls with v4.1 kernel in
>>>>> different workloads. I was reported similar issue with
>>>>> RDS as well and looking at [1], [2], [3] and [4], thought
>>>>> of reaching out to see if you can help us to understand
>>>>> this issue better.
>>>>>
>>>>> Have also included RCU specific config used in these
>>>>> test(s). Its very hard to reproduce the issue but one of
>>>>> the data point is, it reproduces on systems with larger
>>>>> CPUs(64+). Same workload with less than 64 CPUs, don't
>>>>> show the issue. Someone also told me, making use of
>>>>> SLAB instead SLUB allocator makes difference but I
>>>>> haven't verified that part for RDS.
>>>>>
>>>>> Let me know your thoughts. Thanks in advance !!
>>>>>
>>>> One of my colleague told me the pastebin server I used
>>>> is Oracle internal only so adding the relevant logs along
>>>> with email.
>>>>
>>
>> [...]
>>
>>>>> [1] https://lkml.org/lkml/2014/12/14/304
>>>>
>>>>
>>>> [2]  Log 1 snippet:
>>>> -----------------------------------------------------------------
>>>> INFO: rcu_sched self-detected stall on CPU
>>>> INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
>>>> g=66023 c=66022 q=0)
>>>> Task dump for CPU 54:
>>>> ksoftirqd/54    R  running task        0   389      2 0x00000008
>>>>  0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
>>>>  ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
>>>>  ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
>>>> Call Trace:
>>>>  <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
>>>>  [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
>>>>  [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
>>>>  [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
>>>>  [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
>>>>  [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
>>>>  [<ffffffff810e9772>] update_process_times+0x42/0x70
>>>>  [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
>>>>  [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
>>>>  [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
>>>>  [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
>>>>  [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
>>>>  [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
>>>>  [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
>>>>  [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
>>>>  <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
>>>>  [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
>>>>  [<ffffffff8118e775>] __free_pages+0x25/0x40
>>>>  [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
>>>>  [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
>>>>  [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
>>>>  [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
>>>>  [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
>>>>  [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
>>>>  [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
>>>
>>> The most likely possibility is that there is a 60-second-long loop in
>>> one of the above functions.  This is within bottom-half execution, so
>>> unfortunately the usual trick of placing cond_resched_rcu_qs() within this
>>> loop, but outside of any RCU read-side critical section does not work.
>>>
>> First of all thanks for explanation.
>>
>> There is no loop which can last for 60 seconds in above code since
>> its just completion queue handler used to free up buffers much like
>> NIC
>> drivers bottom half(NAPI). Its done in tasklet context for latency
>> reasons which RDS care most. Just to get your attention, the RCU
>> stall is also seen with XEN code too. Log for it end of the email.
>>
>> Another important observation is, for RDS if we avoid higher
>> order page(s) allocation, issue is not reproducible so far.
>> In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
>> the system continues to run without any issue, so the loop scenario
>> is ruled out more or less.
>>
>> To be specific, with PAGE_SIZE allocations, alloc_pages()
>> is just allocating a page and __free_page() is used
>> instead of __free_pages() from below snippet.
>>
>> ------------------
>> if (bytes >= PAGE_SIZE)
>> 	page = alloc_pages(gfp, get_order(bytes));
>>
>> .....
>>
>> (rm->data.op_sg[i].length <= PAGE_SIZE) ?
>> __free_page(sg_page(&rm->data.op_sg[i])) :
>> __free_pages(sg_page(&rm->data.op_sg[i]),
>> get_order(rm->data.op_sg[i].length));
>> ----------------------------
>
> This sounds like something to take up with the mm folks.
>
Sure. Will do once the link between two issues is established.

>>> Therefore, if there really is a loop here, one fix would be to
>>> periodically unwind back out to run_ksoftirqd(), but setting up so that
>>> the work would be continued later.  Another fix might be to move this
>> >from tasklet context to workqueue context, where cond_resched_rcu_qs()
>>> can be used -- however, this looks a bit like networking code, which
>>> does not always take kindly to being run in process context (though
>>> careful use of local_bh_disable() and local_bh_enable() can sometimes
>>> overcome this issue).  A third fix, which works only if this code does
>>> not use RCU and does not invoke any code that does use RCU, is to tell
>>> RCU that it should ignore this code (which will require a little work
>>> on RCU, as it currently does not tolerate this sort of thing aside from
>>> the idle threads).  In this last approach, event-tracing calls must use
>>> the _nonidle suffix.
>>>
>>> I am not familiar with the RDS code, so I cannot be more specific.
>>
>> No worries. Since we saw the issue with XEN too, I was suspecting
>> that somehow we didn't have RCU_TREE config setup correctly or
>> some important RCU patch(s) missing in v4.1 which made it in
>> later kernels.
>>
>> The only common denominator I saw between these two different
>> usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
>> which I believe triggers the rcu_sched() chain.
>
> Exactly!
>
>> I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
>> get more information out of the system. Do you suggest any other
>> RCU option(s)/patch(s) which we can help us to capture more
>> information to understand the issue better. I wasn't sure about
>> options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
>> is also mentioned end of the email.
>
> Hmmm...  I just now noticed the "All QSes seen" message below.
> That can happen if the load is quite high, and could as you say
> one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> This is not free, as it will mean more context switches involving
> the RCU grace-period kthreads.
>
> Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> smaller than 60, say, 21.  This will cause the stall warnings to leave
> less time before splatting.
>
OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very 
high load on the systems.

> Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> you need more housekeeping CPUs than you currently have configured.
>
Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
book keeping. Seems like without that clock-event code will just use
CPU0 for things like broadcasting which might become bottleneck.
This could explain connect the hrtimer_interrupt() path getting slowed
down because of book keeping bottleneck.

$cat .config | grep NO_HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set


Regards,
Santosh