linux-kernel - Re: [PATCH] rcu: Fix missed wakeup of exp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191119040533.GS2889@paulmck-ThinkPad-P72>
Date:   Mon, 18 Nov 2019 20:05:33 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Neeraj Upadhyay <neeraju@...eaurora.org>
Cc:     josh@...htriplett.org, joel@...lfernandes.org, rostedt@...dmis.org,
        mathieu.desnoyers@...icios.com, jiangshanlai@...il.com,
        linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
        pkondeti@...eaurora.org, prsood@...eaurora.org,
        gkohli@...eaurora.org
Subject: Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters

On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
> Hi Paul,
> 
> On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
> > On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
> > > Hi Paul,
> > > 
> > > 
> > > On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
> > > > On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
> > > > > > On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
> > > > > > > For the tasks waiting in exp_wq inside exp_funnel_lock(),
> > > > > > > there is a chance that they might be indefinitely blocked
> > > > > > > in below scenario:
> > > > > > > 
> > > > > > > 1. There is a task waiting on exp sequence 0b'100' inside
> > > > > > >       exp_funnel_lock().
> > > > > > > 
> > > > > > >       _synchronize_rcu_expedited()
> > > > > > 
> > > > > > This symbol went away a few versions back, but let's see how this
> > > > > > plays out in current -rcu.
> > > > > > 
> > > > > 
> > > > > Sorry; for us this problem is observed on 4.19 stable version; I had
> > > > > checked against the -rcu code, and the relevant portions were present
> > > > > there.
> > > > > 
> > > > > > >         s = 0b'100
> > > > > > >         exp_funnel_lock()
> > > > > > >           wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]
> > > > > > 
> > > > > > All of the above could still happen if the expedited grace
> > > > > > period number was zero (or a bit less) when that task invoked
> > > > > 
> > > > > Yes
> > > > > 
> > > > > > synchronize_rcu_expedited().  What is the relation, if any,
> > > > > > between this task and "task1" below?  Seems like you want them to
> > > > > > be different tasks.
> > > > > > 
> > > > > 
> > > > > This task is the one which is waiting for the expedited sequence, which
> > > > > "task1" completes ("task1" holds the exp_mutex for it). "task1" would
> > > > > wake up this task, on exp GP completion.
> > > > > 
> > > > > > Does this task actually block, or is it just getting ready
> > > > > > to block?  Seems like you need it to have actually blocked.
> > > > > > 
> > > > > 
> > > > > Yes, it actually blocked in wait queue.
> > > > > 
> > > > > > > 2. The Exp GP completes and task (task1) holding exp_mutex queues
> > > > > > >       worker and schedules out.
> > > > > > 
> > > > > > "The Exp GP" being the one that was initiated when the .expedited_sequence
> > > > > > counter was zero, correct?  (Looks that way below.)
> > > > > > 
> > > > > Yes, correct.
> > > > > 
> > > > > > >       _synchronize_rcu_expedited()
> > > > > > >         s = 0b'100
> > > > > > >         queue_work(rcu_gp_wq, &rew.rew_work)
> > > > > > >           wake_up_worker()
> > > > > > >             schedule()
> > > > > > > 
> > > > > > > 3. kworker A picks up the queued work and completes the exp gp
> > > > > > >       sequence.
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         rcu_exp_wait_wake()
> > > > > > >           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
> > > > > > >                                   // to 0b'100'
> > > > > > > 
> > > > > > > 4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
> > > > > > >       and releases exp_mutex.
> > > > > > > 
> > > > > > >       wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > > > >         sync_exp_work_done(rsp, s));
> > > > > > >       mutex_unlock(&rsp->exp_mutex);
> > > > > > 
> > > > > > So task1 is the one that initiated the expedited grace period that
> > > > > > started when .expedited_sequence was zero, right?
> > > > > > 
> > > > > 
> > > > > Yes, right.
> > > > > 
> > > > > > > 5. Next exp GP completes, and sequence number is incremented:
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         rcu_exp_wait_wake()
> > > > > > >           rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'
> > > > > > > 
> > > > > > > 6. As kworker A uses current expedited_sequence, it wakes up workers
> > > > > > >       from wrong wait queue index - it should have worken wait queue
> > > > > > >       corresponding to 0b'100' sequence, but wakes up the ones for
> > > > > > >       0b'200' sequence. This results in task at step 1 indefinitely blocked.
> > > > > > > 
> > > > > > >       rcu_exp_wait_wake()
> > > > > > >         wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > > 
> > > > > > So the issue is that the next expedited RCU grace period might
> > > > > > have completed before the completion of the wakeups for the previous
> > > > > > expedited RCU grace period, correct?  Then expedited grace periods have
> > > > > 
> > > > > Yes. Actually from the ftraces, I saw that next expedited RCU grace
> > > > > period completed while kworker A was in D state, while waiting for
> > > > > exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
> > > > > its wake_up_all() call; so, task (point 1) was never woken up, as it was
> > > > > waiting on wq index 1.
> > > > > 
> > > > > > to have stopped to prevent any future wakeup from happening, correct?
> > > > > > (Which would make it harder for rcutorture to trigger this, though it
> > > > > > really does have code that attempts to trigger this sort of thing.)
> > > > > > 
> > > > > > Is this theoretical in nature, or have you actually triggered it?
> > > > > > If actually triggered, what did you do to make this happen?
> > > > > 
> > > > > This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
> > > > > another instance in Nov 2018 (on 4.14 kernel), in our customer reported
> > > > > issues. Both instances were in downstream drivers and we didn't have RCU
> > > > > traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
> > > > > enabled, where it was observed in suspend scenario, where we are observing
> > > > > "DPM device timeout" [1], as scsi device is stuck in
> > > > > _synchronize_rcu_expedited().
> > > > > 
> > > > > schedule+0x70/0x90
> > > > > _synchronize_rcu_expedited+0x590/0x5f8
> > > > > synchronize_rcu+0x50/0xa0
> > > > > scsi_device_quiesce+0x50/0x120
> > > > > scsi_bus_suspend+0x70/0xe8
> > > > > dpm_run_callback+0x148/0x388
> > > > > __device_suspend+0x430/0x8a8
> > > > > 
> > > > > [1]
> > > > > https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489
> > > > > 
> > > > > > What have you done to test the change?
> > > > > > 
> > > > > 
> > > > > I have given this for testing; will share the results . Current analysis
> > > > > and patch is based on going through ftrace and code review.
> > > > 
> > > > OK, very good.  Please include the failure information in the changelog
> > > > of the next version of this patch.
> 
> Done.
> 
> > > > 
> > > > I prefer your original patch, that just uses "s", over the one below
> > > > that moves the rcu_exp_gp_seq_end().  The big advantage of your original
> > > > patch is that it allow more concurrency between a consecutive pair of
> > > > expedited RCU grace periods.  Plus it would not be easy to convince
> > > > myself that moving rcu_exp_gp_seq_end() down is safe, so your original
> > > > is also conceptually simpler with a more manageable state space.
> 
> The reason for highlighting the alternate approach of doing gp end inside
> exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
> please correct me if I am wrong here:
> 
> 1. task0 holds exp_wake_mutex, and is preempted.

Presumably after it has awakened the kthread that initiated the prior
expedited grace period (the one with seq number = -4).

> 2. task1 initiates new GP (current seq number = 0).

Yes, this can happen.

> 3. task1 queues worker kworker1 and schedules out.

And thus still holds .exp_mutex, but yes.

> 4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex

And thus cannot yet have awakened task1.

> 5. task1 releases exp mutex, w/o entering waitq.

So I do not believe that we can get to #5.  What am I missing here?

> 6. task2 initiates new GP (current seq number = 1).
> 7. task2 queues worker kworker1 and schedules out.
> 8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
> ...
> 
> This sequence would break the requirement of max 3 wqs.
> 
> If we hold the exp_wake_mutex before exp seq end, there will be atmost three
> pending GPs - 1. GP which current owner is doing wakeups for,
> 2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
> waiters, which started after GP @ point 2 had started. This also is inline
> with the statement in [1]:
> 
> "he key point is that the ->exp_mutex is not released until
> the first wakeup is complete, which means that the ->exp_wake_mutex
> has already been acquired at that point"
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev

And I believe that this still holds.  The task holding .exp_mutex cannot
release it until it has been awakened, and it won't be awakened until after
the .exp_wake_mutex has been acquired.  Again, what am I missing here?

> > > > Please also add the WARN_ON(), though at first glance your change seems
> > > > to have lost the wakeup.  (But it is early, so maybe it is just that I
> > > > am not yet fully awake.)
> > > 
> > > My bad, I posted incomplete diff in previous mail:
> > > 
> > >   static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > >   {
> > >   	struct rcu_node *rnp;
> > > +	unsigned long exp_low;
> > > +	unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > 
> > >   	synchronize_sched_expedited_wait(rsp);
> > >   	rcu_exp_gp_seq_end(rsp);
> > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > unsigned long s)
> > >   			spin_unlock(&rnp->exp_lock);
> > >   		}
> > >   		smp_mb(); /* All above changes before wakeup. */
> > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > +		exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > +		WARN_ON(s_low != exp_low);
> > > +		wake_up_all(&rnp->exp_wq[exp_low]);
> > 
> > Much better!
> > 
> > But I suggest using s_low in the wake_up_all.  This hunk is of course
> > only for testing purposes, not for actual inclusion.  (My earlier email
> > didn't make that clear.)
> > 
> > 							Thanx, Paul
> > 
> 
> Ok sure, got it. I will share the results, once the issue is reproduced.

Sounds good!

							Thanx, Paul

> Thanks
> Neeraj
> 
> > >   	}
> > > 
> > > 
> > > Thanks
> > > Neeraj
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > I was thinking of another way of addressing this problem: Doing exp seq end
> > > > > inside exp_wake_mutex. This will also ensure that, if we extend the current
> > > > > scenario and there are multiple expedited GP sequence, which have completed,
> > > > > before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
> > > > > entries [2].
> > > > > 
> > > > > 
> > > > > [2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev
> > > > > 
> > > > > 
> > > > > @@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >           struct rcu_node *rnp;
> > > > > 
> > > > >           synchronize_sched_expedited_wait(rsp);
> > > > > -       rcu_exp_gp_seq_end(rsp);
> > > > > -       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > 
> > > > >           /*
> > > > >            * Switch over to wakeup mode, allowing the next GP, but -only- the
> > > > > @@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >            */
> > > > >           mutex_lock(&rsp->exp_wake_mutex);
> > > > > 
> > > > > +       rcu_exp_gp_seq_end(rsp);
> > > > > +       trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
> > > > > +
> > > > > 
> > > > > 
> > > > > 
> > > > > > (Using a WARN_ON() to check for the lower bits of the counter portion
> > > > > > of rcu_state.expedited_sequence differing from the same bits of s
> > > > > > would be one way to detect this problem.)
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > 
> > > > > I have also given the patch for this, for testing:
> > > > > 
> > > > >    static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
> > > > >    {
> > > > >           struct rcu_node *rnp;
> > > > > +       unsigned long exp_low;
> > > > > +       unsigned long s_low = rcu_seq_ctr(s) & 0x3;
> > > > > 
> > > > >           synchronize_sched_expedited_wait(rsp);
> > > > >           rcu_exp_gp_seq_end(rsp);
> > > > > @@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
> > > > > unsigned long s)
> > > > >                           spin_unlock(&rnp->exp_lock);
> > > > >                   }
> > > > >                   smp_mb(); /* All above changes before wakeup. */
> > > > > - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
> > > > > +               exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
> > > > > +               WARN_ON(s_low != exp_low);
> > > > > +
> > > > > 
> > > > > Thanks
> > > > > Neeraj
> > > > > 
> > > > > > > Fix this by using the correct sequence number for wake_up_all() inside
> > > > > > > rcu_exp_wait_wake().
> > > > > > > 
> > > > > > > Signed-off-by: Neeraj Upadhyay <neeraju@...eaurora.org>
> > > > > > > ---
> > > > > > >     kernel/rcu/tree_exp.h | 2 +-
> > > > > > >     1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > > index e4b77d3..28979d3 100644
> > > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > > @@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
> > > > > > >     			spin_unlock(&rnp->exp_lock);
> > > > > > >     		}
> > > > > > >     		smp_mb(); /* All above changes before wakeup. */
> > > > > > > -		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
> > > > > > > +		wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
> > > > > > >     	}
> > > > > > >     	trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
> > > > > > >     	mutex_unlock(&rcu_state.exp_wake_mutex);
> > > > > > > -- 
> > > > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> > > > > > > member of the Code Aurora Forum, hosted by The Linux Foundation
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > > > the Code Aurora Forum, hosted by The Linux Foundation
> > > 
> > > -- 
> > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> > > the Code Aurora Forum, hosted by The Linux Foundation
> 
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
> the Code Aurora Forum, hosted by The Linux Foundation