linux-kernel - Re: Query regarding synchronize_sched_expedited and resched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ede28a3e-7708-f5e7-8480-f339c7c37fc5@codeaurora.org>
Date:   Sun, 17 Sep 2017 11:37:06 +0530
From:   Neeraj Upadhyay <neeraju@...eaurora.org>
To:     paulmck@...ux.vnet.ibm.com
Cc:     josh@...htriplett.org, rostedt@...dmis.org,
        mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
        linux-kernel@...r.kernel.org, sramana@...eaurora.org,
        prsood@...eaurora.org, pkondeti@...eaurora.org,
        markivx@...eaurora.org, peterz@...radead.org
Subject: Re: Query regarding synchronize_sched_expedited and resched_cpu



On 09/17/2017 06:30 AM, Paul E. McKenney wrote:
> On Fri, Sep 15, 2017 at 04:44:38PM +0530, Neeraj Upadhyay wrote:
>> Hi,
>>
>> We have one query regarding the behavior of RCU expedited grace period,
>> for scenario where resched_cpu() in sync_sched_exp_handler() fails to
>> acquire the rq lock and returns w/o setting the need_resched. In this
>> case, how do we ensure that the CPU notify rcu about the
>> end of sched grace period (schedule() -> __schedule() ->
>> rcu_note_context_switch(cpu) -> rcu_sched_qs()) , for cases where tick
>> is stopped on that CPU.  Is it implied from the rq lock acquisition
>> failure, that the owner of the rq lock will enforce context switch?
>> For which scenarios in RCU paths (as the function is used only in RCU
>> code), we need trylock check in resched_cpu()?
>>
>> void resched_cpu(int cpu)
>> {
>>          struct rq *rq = cpu_rq(cpu);
>>          unsigned long flags;
>>
>>          if (!raw_spin_trylock_irqsave(&rq->lock, flags))
>>                  return;
>>          resched_curr(rq);
>>          raw_spin_unlock_irqrestore(&rq->lock, flags);
>> }
>>
>>
>> This issue was observed in below scenario, where one of the CPUs (CPU1)
>> started synchronize_sched_expedited and sent IPI to CPU5, which is in
>> the idle path but handled sync_sched_exp_handler() IPI before
>> rcu_idle_enter().
>> As resched_cpu() failed to acquire the rq lock, need_resched was not set,
>> and CPU went to idle; resulting in expedited stall getting reported
>> by CPU1.
>>
>> Below is the scenario:
>>
>> •    CPU1 is waiting for expedited wait to complete:
>> sync_rcu_exp_select_cpus
>>      rdp->exp_dynticks_snap & 0x1   // returns 1 for CPU5
>>      IPI sent to CPU5
>>
>> synchronize_sched_expedited_wait
>>          ret = swait_event_timeout(
>>                                      rsp->expedited_wq,
>>   sync_rcu_preempt_exp_done(rnp_root),
>>                                      jiffies_stall);
>>
>>             expmask = 0x20 , and CPU 5 is in idle path (in cpuidle_enter())
>>
>>
>>
>> •    CPU5 handles IPI and fails to acquire rq lock.
>>
>> Handles IPI
>>      sync_sched_exp_handler
>>          resched_cpu
>>              returns while failing to try lock acquire rq->lock
>>          need_resched is not set
>>
>> •    CPU5 calls  rcu_idle_enter() and as need_resched is not set, goes to
>>      idle (schedule() is not called).
>>
>> •    CPU 1 reports RCU stall.
> Good catch and good detective work!!!
>
> I will be working on a fix this week, hopefully involving resched_cpu()
> getting a return value so that I can track who needs a later retry.
>
> 							Thanx, Paul
>
Hi Paul, how about replacing raw_spin_trylock_irqsave with
raw_spin_lock_irqsave in resched_cpu()? Are there any paths
in RCU code, which depend on trylock check/spinlock recursion?

Thanks
Neeraj

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation