[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ede28a3e-7708-f5e7-8480-f339c7c37fc5@codeaurora.org>
Date: Sun, 17 Sep 2017 11:37:06 +0530
From: Neeraj Upadhyay <neeraju@...eaurora.org>
To: paulmck@...ux.vnet.ibm.com
Cc: josh@...htriplett.org, rostedt@...dmis.org,
mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
linux-kernel@...r.kernel.org, sramana@...eaurora.org,
prsood@...eaurora.org, pkondeti@...eaurora.org,
markivx@...eaurora.org, peterz@...radead.org
Subject: Re: Query regarding synchronize_sched_expedited and resched_cpu
On 09/17/2017 06:30 AM, Paul E. McKenney wrote:
> On Fri, Sep 15, 2017 at 04:44:38PM +0530, Neeraj Upadhyay wrote:
>> Hi,
>>
>> We have one query regarding the behavior of RCU expedited grace period,
>> for scenario where resched_cpu() in sync_sched_exp_handler() fails to
>> acquire the rq lock and returns w/o setting the need_resched. In this
>> case, how do we ensure that the CPU notify rcu about the
>> end of sched grace period (schedule() -> __schedule() ->
>> rcu_note_context_switch(cpu) -> rcu_sched_qs()) , for cases where tick
>> is stopped on that CPU. Is it implied from the rq lock acquisition
>> failure, that the owner of the rq lock will enforce context switch?
>> For which scenarios in RCU paths (as the function is used only in RCU
>> code), we need trylock check in resched_cpu()?
>>
>> void resched_cpu(int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> unsigned long flags;
>>
>> if (!raw_spin_trylock_irqsave(&rq->lock, flags))
>> return;
>> resched_curr(rq);
>> raw_spin_unlock_irqrestore(&rq->lock, flags);
>> }
>>
>>
>> This issue was observed in below scenario, where one of the CPUs (CPU1)
>> started synchronize_sched_expedited and sent IPI to CPU5, which is in
>> the idle path but handled sync_sched_exp_handler() IPI before
>> rcu_idle_enter().
>> As resched_cpu() failed to acquire the rq lock, need_resched was not set,
>> and CPU went to idle; resulting in expedited stall getting reported
>> by CPU1.
>>
>> Below is the scenario:
>>
>> • CPU1 is waiting for expedited wait to complete:
>> sync_rcu_exp_select_cpus
>> rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
>> IPI sent to CPU5
>>
>> synchronize_sched_expedited_wait
>> ret = swait_event_timeout(
>> rsp->expedited_wq,
>> sync_rcu_preempt_exp_done(rnp_root),
>> jiffies_stall);
>>
>> expmask = 0x20 , and CPU 5 is in idle path (in cpuidle_enter())
>>
>>
>>
>> • CPU5 handles IPI and fails to acquire rq lock.
>>
>> Handles IPI
>> sync_sched_exp_handler
>> resched_cpu
>> returns while failing to try lock acquire rq->lock
>> need_resched is not set
>>
>> • CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
>> idle (schedule() is not called).
>>
>> • CPU 1 reports RCU stall.
> Good catch and good detective work!!!
>
> I will be working on a fix this week, hopefully involving resched_cpu()
> getting a return value so that I can track who needs a later retry.
>
> Thanx, Paul
>
Hi Paul, how about replacing raw_spin_trylock_irqsave with
raw_spin_lock_irqsave in resched_cpu()? Are there any paths
in RCU code, which depend on trylock check/spinlock recursion?
Thanks
Neeraj
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation
Powered by blists - more mailing lists