[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0101016e7f6441dc-8b48f47b-1771-4616-a198-6aeda4030807-000000@us-west-2.amazonses.com>
Date: Mon, 18 Nov 2019 16:41:47 +0000
From: Neeraj Upadhyay <neeraju@...eaurora.org>
To: paulmck@...nel.org
Cc: josh@...htriplett.org, joel@...lfernandes.org, rostedt@...dmis.org,
mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
pkondeti@...eaurora.org, prsood@...eaurora.org,
gkohli@...eaurora.org
Subject: Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters
Hi Paul,
On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
>> Hi Paul,
>>
>> On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
>>> On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
>>>> For the tasks waiting in exp_wq inside exp_funnel_lock(),
>>>> there is a chance that they might be indefinitely blocked
>>>> in below scenario:
>>>>
>>>> 1. There is a task waiting on exp sequence 0b'100' inside
>>>> exp_funnel_lock().
>>>>
>>>> _synchronize_rcu_expedited()
>>>
>>> This symbol went away a few versions back, but let's see how this
>>> plays out in current -rcu.
>>>
>>
>> Sorry; for us this problem is observed on 4.19 stable version; I had
>> checked against the -rcu code, and the relevant portions were present
>> there.
>>
>>>> s = 0b'100
>>>> exp_funnel_lock()
>>>> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
>>>
>>> All of the above could still happen if the expedited grace
>>> period number was zero (or a bit less) when that task invoked
>>
>> Yes
>>
>>> synchronize_rcu_expedited(). What is the relation, if any,
>>> between this task and "task1" below? Seems like you want them to
>>> be different tasks.
>>>
>>
>> This task is the one which is waiting for the expedited sequence, which
>> "task1" completes ("task1" holds the exp_mutex for it). "task1" would
>> wake up this task, on exp GP completion.
>>
>>> Does this task actually block, or is it just getting ready
>>> to block? Seems like you need it to have actually blocked.
>>>
>>
>> Yes, it actually blocked in wait queue.
>>
>>>> 2. The Exp GP completes and task (task1) holding exp_mutex queues
>>>> worker and schedules out.
>>>
>>> "The Exp GP" being the one that was initiated when the .expedited_sequence
>>> counter was zero, correct? (Looks that way below.)
>>>
>> Yes, correct.
>>
>>>> _synchronize_rcu_expedited()
>>>> s = 0b'100
>>>> queue_work(rcu_gp_wq, &rew.rew_work)
>>>> wake_up_worker()
>>>> schedule()
>>>>
>>>> 3. kworker A picks up the queued work and completes the exp gp
>>>> sequence.
>>>>
>>>> rcu_exp_wait_wake()
>>>> rcu_exp_wait_wake()
>>>> rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
>>>> // to 0b'100'
>>>>
>>>> 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
>>>> and releases exp_mutex.
>>>>
>>>> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>>>> sync_exp_work_done(rsp, s));
>>>> mutex_unlock(&rsp->exp_mutex);
>>>
>>> So task1 is the one that initiated the expedited grace period that
>>> started when .expedited_sequence was zero, right?
>>>
>>
>> Yes, right.
>>
>>>> 5. Next exp GP completes, and sequence number is incremented:
>>>>
>>>> rcu_exp_wait_wake()
>>>> rcu_exp_wait_wake()
>>>> rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
>>>>
>>>> 6. As kworker A uses current expedited_sequence, it wakes up workers
>>>> from wrong wait queue index - it should have worken wait queue
>>>> corresponding to 0b'100' sequence, but wakes up the ones for
>>>> 0b'200' sequence. This results in task at step 1 indefinitely blocked.
>>>>
>>>> rcu_exp_wait_wake()
>>>> wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>>>
>>> So the issue is that the next expedited RCU grace period might
>>> have completed before the completion of the wakeups for the previous
>>> expedited RCU grace period, correct? Then expedited grace periods have
>>
>> Yes. Actually from the ftraces, I saw that next expedited RCU grace
>> period completed while kworker A was in D state, while waiting for
>> exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
>> its wake_up_all() call; so, task (point 1) was never woken up, as it was
>> waiting on wq index 1.
>>
>>> to have stopped to prevent any future wakeup from happening, correct?
>>> (Which would make it harder for rcutorture to trigger this, though it
>>> really does have code that attempts to trigger this sort of thing.)
>>>
>>> Is this theoretical in nature, or have you actually triggered it?
>>> If actually triggered, what did you do to make this happen?
>>
>> This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
>> another instance in Nov 2018 (on 4.14 kernel), in our customer reported
>> issues. Both instances were in downstream drivers and we didn't have RCU
>> traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
>> enabled, where it was observed in suspend scenario, where we are observing
>> "DPM device timeout" [1], as scsi device is stuck in
>> _synchronize_rcu_expedited().
>>
>> schedule+0x70/0x90
>> _synchronize_rcu_expedited+0x590/0x5f8
>> synchronize_rcu+0x50/0xa0
>> scsi_device_quiesce+0x50/0x120
>> scsi_bus_suspend+0x70/0xe8
>> dpm_run_callback+0x148/0x388
>> __device_suspend+0x430/0x8a8
>>
>> [1]
>> https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
>>
>>> What have you done to test the change?
>>>
>>
>> I have given this for testing; will share the results . Current analysis
>> and patch is based on going through ftrace and code review.
>
> OK, very good. Please include the failure information in the changelog
> of the next version of this patch.
>
> I prefer your original patch, that just uses "s", over the one below
> that moves the rcu_exp_gp_seq_end(). The big advantage of your original
> patch is that it allow more concurrency between a consecutive pair of
> expedited RCU grace periods. Plus it would not be easy to convince
> myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> is also conceptually simpler with a more manageable state space.
>
> Please also add the WARN_ON(), though at first glance your change seems
> to have lost the wakeup. (But it is early, so maybe it is just that I
> am not yet fully awake.)
My bad, I posted incomplete diff in previous mail:
static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
{
struct rcu_node *rnp;
+ unsigned long exp_low;
+ unsigned long s_low = rcu_seq_ctr(s) & 0x3;
synchronize_sched_expedited_wait(rsp);
rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
unsigned long s)
spin_unlock(&rnp->exp_lock);
}
smp_mb(); /* All above changes before wakeup. */
- wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+ exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+ WARN_ON(s_low != exp_low);
+ wake_up_all(&rnp->exp_wq[exp_low]);
}
Thanks
Neeraj
>
> Thanx, Paul
>
>> I was thinking of another way of addressing this problem: Doing exp seq end
>> inside exp_wake_mutex. This will also ensure that, if we extend the current
>> scenario and there are multiple expedited GP sequence, which have completed,
>> before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
>> entries [2].
>>
>>
>> [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
>>
>>
>> @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>> struct rcu_node *rnp;
>>
>> synchronize_sched_expedited_wait(rsp);
>> - rcu_exp_gp_seq_end(rsp);
>> - trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>>
>> /*
>> * Switch over to wakeup mode, allowing the next GP, but -only- the
>> @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>> */
>> mutex_lock(&rsp->exp_wake_mutex);
>>
>> + rcu_exp_gp_seq_end(rsp);
>> + trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
>> +
>>
>>
>>
>>> (Using a WARN_ON() to check for the lower bits of the counter portion
>>> of rcu_state.expedited_sequence differing from the same bits of s
>>> would be one way to detect this problem.)
>>>
>>> Thanx, Paul
>>>
>>
>> I have also given the patch for this, for testing:
>>
>> static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
>> {
>> struct rcu_node *rnp;
>> + unsigned long exp_low;
>> + unsigned long s_low = rcu_seq_ctr(s) & 0x3;
>>
>> synchronize_sched_expedited_wait(rsp);
>> rcu_exp_gp_seq_end(rsp);
>> @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
>> unsigned long s)
>> spin_unlock(&rnp->exp_lock);
>> }
>> smp_mb(); /* All above changes before wakeup. */
>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
>> + exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
>> + WARN_ON(s_low != exp_low);
>> +
>>
>> Thanks
>> Neeraj
>>
>>>> Fix this by using the correct sequence number for wake_up_all() inside
>>>> rcu_exp_wait_wake().
>>>>
>>>> Signed-off-by: Neeraj Upadhyay <neeraju@...eaurora.org>
>>>> ---
>>>> kernel/rcu/tree_exp.h | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index e4b77d3..28979d3 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
>>>> spin_unlock(&rnp->exp_lock);
>>>> }
>>>> smp_mb(); /* All above changes before wakeup. */
>>>> - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
>>>> + wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
>>>> }
>>>> trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
>>>> mutex_unlock(&rcu_state.exp_wake_mutex);
>>>> --
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>
>>
>> --
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
>> the Code Aurora Forum, hosted by The Linux Foundation
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation
Powered by blists - more mailing lists