linux-kernel - Re: [PATCH 01/11] rcu: avoid leaking exp_deferred

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2cf71e70-4cb3-57f8-f542-69ddf04106dd@linux.alibaba.com>
Date:   Fri, 1 Nov 2019 02:19:13 +0800
From:   Lai Jiangshan <laijs@...ux.alibaba.com>
To:     paulmck@...nel.org
Cc:     linux-kernel@...r.kernel.org,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Joel Fernandes <joel@...lfernandes.org>, rcu@...r.kernel.org
Subject: Re: [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP



On 2019/10/31 9:43 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:56AM +0000, Lai Jiangshan wrote:
>> If exp_deferred_qs is incorrectly set and leaked to the next
>> exp GP, it may cause the next GP to be incorrectly prematurely
>> completed.
> 
> Could you please provide the sequence of events leading to a such a
> failure?

I just felt nervous with "leaking" exp_deferred_qs.
I didn't careful consider the sequence of events.

Now it proves that I must have misunderstood the exp_deferred_qs.
So call "leaking" is wrong concept, preempt_disable()
is considered as rcu_read_lock() and exp_deferred_qs
needs to be set.

Thanks
Lai

============don't need to read:

read_read_lock()
// other cpu start exp GP_A
preempt_schedule() // queue itself
read_read_unlock() //report qs, other cpu is sending ipi to me
preempt_disable
   rcu_exp_handler() interrupt for GP_A and leave a exp_deferred_qs
   // exp GP_A finished
   ---------------above is one possible way to leave a exp_deferred_qs
preempt_enable()
  interrupt before preempt_schedule()
   read_read_lock()
   read_read_unlock()
    NESTED interrupt when nagative rcu_read_lock_nesting
     read_read_lock()
     // other cpu start exp GP_B
     NESTED interrupt for rcu_flavor_sched_clock_irq()
      report exq qs since rcu_read_lock_nesting <0 and \
      exp_deferred_qs is true
     // exp GP_B complete
     read_read_unlock()

This plausible sequence relies on NESTED interrupt too,
and can be avoided by patch2 if NESTED interrupt were allowed.

> 
> Also, did you provoke such a failure in testing?  If so, an upgrade
> to rcutorture would be good, so please tell me what you did to make
> the failure happen.
> 
> I do like the reduction in state space, but I am a bit concerned about
> the potential increase in contention on rnp->lock.  Thoughts?
> 
> 							Thanx, Paul
> 
>> Signed-off-by: Lai Jiangshan <laijs@...ux.alibaba.com>
>> ---
>>   kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
>>   1 file changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index a0e1e51c51c2..6dec21909b30 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
>>   	struct rcu_node *rnp = rdp->mynode;
>>   	struct task_struct *t = current;
>>   
>> +	/*
>> +	 * Note that there is a large group of race conditions that
>> +	 * can have caused this quiescent state to already have been
>> +	 * reported, so we really do need to check ->expmask first.
>> +	 */
>> +	raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> +	if (!(rnp->expmask & rdp->grpmask)) {
>> +		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +		return;
>> +	}
>> +	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +
>>   	/*
>>   	 * First, the common case of not being in an RCU read-side
>>   	 * critical section.  If also enabled or idle, immediately
>> @@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
>>   	 * a future context switch.  Either way, if the expedited
>>   	 * grace period is still waiting on this CPU, set ->deferred_qs
>>   	 * so that the eventual quiescent state will be reported.
>> -	 * Note that there is a large group of race conditions that
>> -	 * can have caused this quiescent state to already have been
>> -	 * reported, so we really do need to check ->expmask.
>>   	 */
>>   	if (t->rcu_read_lock_nesting > 0) {
>> -		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> -		if (rnp->expmask & rdp->grpmask) {
>> -			rdp->exp_deferred_qs = true;
>> -			t->rcu_read_unlock_special.b.exp_hint = true;
>> -		}
>> -		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +		rdp->exp_deferred_qs = true;
>> +		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>>   		return;
>>   	}
>>   
>> -- 
>> 2.20.1
>>