linux-kernel - Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F4DF8E4.1070100@cn.fujitsu.com>
Date:	Wed, 29 Feb 2012 18:07:32 +0800
From:	Lai Jiangshan <laijs@...fujitsu.com>
To:	paulmck@...ux.vnet.ibm.com
CC:	linux-kernel@...r.kernel.org, mingo@...e.hu, dipankar@...ibm.com,
	akpm@...ux-foundation.org, mathieu.desnoyers@...ymtl.ca,
	josh@...htriplett.org, niv@...ibm.com, tglx@...utronix.de,
	peterz@...radead.org, rostedt@...dmis.org, Valdis.Kletnieks@...edu,
	dhowells@...hat.com, eric.dumazet@...il.com, darren@...art.com,
	fweisbec@...il.com, patches@...aro.org
Subject: Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
>> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
>>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
>>>> From: Lai Jiangshan <laijs@...fujitsu.com>
>>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
>>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
>>>>
>>>> This patch implement the algorithm as Peter's:
>>>> https://lkml.org/lkml/2012/2/1/119
>>>>
>>>> o	Make the checking lock-free and we can perform parallel checking,
>>>> 	Although almost parallel checking makes no sense, but we need it
>>>> 	when 1) the original checking task is preempted for long, 2)
>>>> 	sychronize_srcu_expedited(), 3) avoid lock(see next)
>>>>
>>>> o	Since it is lock-free, we save a mutex in state machine for
>>>> 	call_srcu().
>>>>
>>>> o	Remove the SRCU_REF_MASK and remove the coupling with the flipping.
>>>> 	(so we can remove the preempt_disable() in future, but use
>>>> 	 __this_cpu_inc() instead.)
>>>>
>>>> o	reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
>>>> 	more intuitive.
>>>
>>> Hello, Lai,
>>>
>>> Interesting approach!
>>>
>>> What happens given the following sequence of events?
>>>
>>> o	CPU 0 in srcu_readers_active_idx_check() invokes
>>> 	srcu_readers_seq_idx(), getting some number back.
>>>
>>> o	CPU 0 invokes srcu_readers_active_idx(), summing the
>>> 	->c[] array up through CPU 3.
>>>
>>> o	CPU 1 invokes __srcu_read_lock(), and increments its counter
>>> 	but not yet its ->seq[] element.
>>
>>
>> Any __srcu_read_lock() whose increment of active counter is not seen
>> by srcu_readers_active_idx() is considerred as
>> "reader-started-after-this-srcu_readers_active_idx_check()",
>> We don't need to wait.
>>
>> As you said, this srcu C.S 's increment seq is not seen by above
>> srcu_readers_seq_idx().
>>
>>>
>>> o	CPU 0 completes its summing of the ->c[] array, incorrectly
>>> 	obtaining zero.
>>>
>>> o	CPU 0 invokes srcu_readers_seq_idx(), getting the same
>>> 	number back that it got last time.
>>
>> If it incorrectly get zero, it means __srcu_read_unlock() is seen
>> in srcu_readers_active_idx(), and it means the increment of
>> seq is seen in this srcu_readers_seq_idx(), it is different
>> from the above seq that it got last time.
>>
>> increment of seq is not seen by above srcu_readers_seq_idx(),
>> but is seen by later one, so the two returned seq is different,
>> this is the core of Peter's algorithm, and this was written
>> in the comments(Sorry for my bad English). Or maybe I miss
>> your means in this mail.
> 
> OK, good, this analysis agrees with what I was thinking.
> 
> So my next question is about the lock freedom.  This lock freedom has to
> be limited in nature and carefully implemented.  The reasons for this are:
> 
> 1.	Readers can block in any case, which can of course block both
> 	synchronize_srcu_expedited() and synchronize_srcu().
> 
> 2.	Because only one CPU at a time can be incrementing ->completed,
> 	some sort of lock with preemption disabling will of course be
> 	needed.  Alternatively, an rt_mutex could be used for its
> 	priority-inheritance properties.
> 
> 3.	Once some CPU has incremented ->completed, all CPUs that might
> 	still be summing up the old indexes must stop.  If they don't,
> 	they might incorrectly call a too-short grace period in case of
> 	->seq[]-sum overflow on 32-bit systems.
> 
> Or did you have something else in mind?

When flip happens when check_zero, this check_zero will no be
committed even it is success.

I play too much with lock-free for call_srcu(), the code becomes complicated,
I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

(But I still like Peter's approach, it has some other good thing
besides lock-free-checking, if you don't like it, I will send
another patch to fix srcu_readers_active())

Thanks,
Lai

> 
> 							Thanx, Paul
> 
>> Thanks,
>> Lai
>>
>>>
>>> o	In parallel with the previous step, CPU 1 executes out of order
>>> 	(as permitted by the lack of a second memory barrier in
>>> 	__srcu_read_lock()), starting up the critical section before
>>> 	incrementing its ->seq[] element.
>>>
>>> o	Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
>>> 	completes the SRCU grace period before CPU 1 completes its
>>> 	SRCU read-side critical section.
>>>
>>> This actually might be safe, but I need to think more about it.  In the
>>> meantime, I figured I should ask your thoughts.
>>>
>>> 							Thanx, Paul
>>>
>>>> Inspired-by: Peter Zijlstra <peterz@...radead.org>
>>>> Signed-off-by: Lai Jiangshan <laijs@...fujitsu.com>
>>>> ---
>>>>  include/linux/srcu.h |    7 +--
>>>>  kernel/srcu.c        |  137 ++++++++++++++++++++-----------------------------
>>>>  2 files changed, 57 insertions(+), 87 deletions(-)
>>>>
>>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>>>> index 5b49d41..15354db 100644
>>>> --- a/include/linux/srcu.h
>>>> +++ b/include/linux/srcu.h
>>>> @@ -32,18 +32,13 @@
>>>>
>>>>  struct srcu_struct_array {
>>>>  	unsigned long c[2];
>>>> +	unsigned long seq[2];
>>>>  };
>>>>
>>>> -/* Bit definitions for field ->c above and ->snap below. */
>>>> -#define SRCU_USAGE_BITS		1
>>>> -#define SRCU_REF_MASK		(ULONG_MAX >> SRCU_USAGE_BITS)
>>>> -#define SRCU_USAGE_COUNT	(SRCU_REF_MASK + 1)
>>>> -
>>>>  struct srcu_struct {
>>>>  	unsigned completed;
>>>>  	struct srcu_struct_array __percpu *per_cpu_ref;
>>>>  	struct mutex mutex;
>>>> -	unsigned long snap[NR_CPUS];
>>>>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>>>  	struct lockdep_map dep_map;
>>>>  #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
>>>> index 47ee35d..376b583 100644
>>>> --- a/kernel/srcu.c
>>>> +++ b/kernel/srcu.c
>>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>>>  #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
>>>>
>>>>  /*
>>>> + * Returns approximate total sequence of readers on the specified rank
>>>> + * of per-CPU counters.
>>>> + */
>>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
>>>> +{
>>>> +	int cpu;
>>>> +	unsigned long sum = 0;
>>>> +	unsigned long t;
>>>> +
>>>> +	for_each_possible_cpu(cpu) {
>>>> +		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
>>>> +		sum += t;
>>>> +	}
>>>> +	return sum;
>>>> +}
>>>> +
>>>> +/*
>>>>   * Returns approximate number of readers active on the specified rank
>>>> - * of per-CPU counters.  Also snapshots each counter's value in the
>>>> - * corresponding element of sp->snap[] for later use validating
>>>> - * the sum.
>>>> + * of per-CPU counters.
>>>>   */
>>>>  static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>>  {
>>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
>>>>  	for_each_possible_cpu(cpu) {
>>>>  		t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
>>>>  		sum += t;
>>>> -		sp->snap[cpu] = t;
>>>>  	}
>>>> -	return sum & SRCU_REF_MASK;
>>>> +	return sum;
>>>>  }
>>>>
>>>> -/*
>>>> - * To be called from the update side after an index flip.  Returns true
>>>> - * if the modulo sum of the counters is stably zero, false if there is
>>>> - * some possibility of non-zero.
>>>> - */
>>>>  static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>>  {
>>>>  	int cpu;
>>>> +	unsigned long seq;
>>>> +
>>>> +	seq = srcu_readers_seq_idx(sp, idx);
>>>> +
>>>> +	/*
>>>> +	 * smp_mb() A pairs with smp_mb() B for critical section.
>>>> +	 * It ensures that the SRCU read-side critical section whose
>>>> +	 * read-lock is not seen by the following srcu_readers_active_idx()
>>>> +	 * will see any updates that before the current task performed before.
>>>> +	 * (So we don't need to care these readers this time)
>>>> +	 *
>>>> +	 * Also, if we see the increment of the seq, we must see the
>>>> +	 * increment of the active counter in the following
>>>> +	 * srcu_readers_active_idx().
>>>> +	 */
>>>> +	smp_mb(); /* A */
>>>>
>>>>  	/*
>>>>  	 * Note that srcu_readers_active_idx() can incorrectly return
>>>>  	 * zero even though there is a pre-existing reader throughout.
>>>>  	 * To see this, suppose that task A is in a very long SRCU
>>>>  	 * read-side critical section that started on CPU 0, and that
>>>> -	 * no other reader exists, so that the modulo sum of the counters
>>>> +	 * no other reader exists, so that the sum of the counters
>>>>  	 * is equal to one.  Then suppose that task B starts executing
>>>>  	 * srcu_readers_active_idx(), summing up to CPU 1, and then that
>>>>  	 * task C starts reading on CPU 0, so that its increment is not
>>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
>>>>  		return false;
>>>>
>>>>  	/*
>>>> -	 * Since the caller recently flipped ->completed, we can see at
>>>> -	 * most one increment of each CPU's counter from this point
>>>> -	 * forward.  The reason for this is that the reader CPU must have
>>>> -	 * fetched the index before srcu_readers_active_idx checked
>>>> -	 * that CPU's counter, but not yet incremented its counter.
>>>> -	 * Its eventual counter increment will follow the read in
>>>> -	 * srcu_readers_active_idx(), and that increment is immediately
>>>> -	 * followed by smp_mb() B.  Because smp_mb() D is between
>>>> -	 * the ->completed flip and srcu_readers_active_idx()'s read,
>>>> -	 * that CPU's subsequent load of ->completed must see the new
>>>> -	 * value, and therefore increment the counter in the other rank.
>>>> -	 */
>>>> -	smp_mb(); /* A */
>>>> -
>>>> -	/*
>>>> -	 * Now, we check the ->snap array that srcu_readers_active_idx()
>>>> -	 * filled in from the per-CPU counter values. Since
>>>> -	 * __srcu_read_lock() increments the upper bits of the per-CPU
>>>> -	 * counter, an increment/decrement pair will change the value
>>>> -	 * of the counter.  Since there is only one possible increment,
>>>> -	 * the only way to wrap the counter is to have a huge number of
>>>> -	 * counter decrements, which requires a huge number of tasks and
>>>> -	 * huge SRCU read-side critical-section nesting levels, even on
>>>> -	 * 32-bit systems.
>>>> -	 *
>>>> -	 * All of the ways of confusing the readings require that the scan
>>>> -	 * in srcu_readers_active_idx() see the read-side task's decrement,
>>>> -	 * but not its increment.  However, between that decrement and
>>>> -	 * increment are smb_mb() B and C.  Either or both of these pair
>>>> -	 * with smp_mb() A above to ensure that the scan below will see
>>>> -	 * the read-side tasks's increment, thus noting a difference in
>>>> -	 * the counter values between the two passes.
>>>> +	 * Validation step, smp_mb() D pairs with smp_mb() C. If the above
>>>> +	 * srcu_readers_active_idx() see a decrement of the active counter
>>>> +	 * in srcu_read_unlock(), it should see one of these for corresponding
>>>> +	 * srcu_read_lock():
>>>> +	 * 	See the increment of the active counter,
>>>> +	 * 	Failed to see the increment of the active counter.
>>>> +	 * The second one can cause srcu_readers_active_idx() incorrectly
>>>> +	 * return zero, but it means the above srcu_readers_seq_idx() does not
>>>> +	 * see the increment of the seq(ref: comments of smp_mb() A),
>>>> +	 * and the following srcu_readers_seq_idx() sees the increment of
>>>> +	 * the seq. The seq is changed.
>>>>  	 *
>>>> -	 * Therefore, if srcu_readers_active_idx() returned zero, and
>>>> -	 * none of the counters changed, we know that the zero was the
>>>> -	 * correct sum.
>>>> -	 *
>>>> -	 * Of course, it is possible that a task might be delayed
>>>> -	 * for a very long time in __srcu_read_lock() after fetching
>>>> -	 * the index but before incrementing its counter.  This
>>>> -	 * possibility will be dealt with in __synchronize_srcu().
>>>> +	 * This smp_mb() D pairs with smp_mb() C for critical section.
>>>> +	 * then any of the current task's subsequent code will happen after
>>>> +	 * that SRCU read-side critical section whose read-unlock is seen in
>>>> +	 * srcu_readers_active_idx().
>>>>  	 */
>>>> -	for_each_possible_cpu(cpu)
>>>> -		if (sp->snap[cpu] !=
>>>> -		    ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
>>>> -			return false;  /* False zero reading! */
>>>> -	return true;
>>>> +	smp_mb(); /* D */
>>>> +
>>>> +	return srcu_readers_seq_idx(sp, idx) == seq;
>>>>  }
>>>>
>>>>  /**
>>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>>>  	preempt_disable();
>>>>  	idx = rcu_dereference_index_check(sp->completed,
>>>>  					  rcu_read_lock_sched_held()) & 0x1;
>>>> -	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
>>>> -		SRCU_USAGE_COUNT + 1;
>>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
>>>>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>>>> +	ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
>>>>  	preempt_enable();
>>>>  	return idx;
>>>>  }
>>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>>  	int trycount = 0;
>>>>
>>>>  	/*
>>>> -	 * If a reader fetches the index before the ->completed increment,
>>>> -	 * but increments its counter after srcu_readers_active_idx_check()
>>>> -	 * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
>>>> -	 * smp_mb() B to ensure that the SRCU read-side critical section
>>>> -	 * will see any updates that the current task performed before its
>>>> -	 * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
>>>> -	 * as the case may be.
>>>> -	 */
>>>> -	smp_mb(); /* D */
>>>> -
>>>> -	/*
>>>>  	 * SRCU read-side critical sections are normally short, so wait
>>>>  	 * a small amount of time before possibly blocking.
>>>>  	 */
>>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
>>>>  				schedule_timeout_interruptible(1);
>>>>  		}
>>>>  	}
>>>> -
>>>> -	/*
>>>> -	 * The following smp_mb() E pairs with srcu_read_unlock()'s
>>>> -	 * smp_mb C to ensure that if srcu_readers_active_idx_check()
>>>> -	 * sees srcu_read_unlock()'s counter decrement, then any
>>>> -	 * of the current task's subsequent code will happen after
>>>> -	 * that SRCU read-side critical section.
>>>> -	 *
>>>> -	 * It also ensures the order between the above waiting and
>>>> -	 * the next flipping.
>>>> -	 */
>>>> -	smp_mb(); /* E */
>>>>  }
>>>>
>>>>  static void srcu_flip(struct srcu_struct *sp)
>>>> -- 
>>>> 1.7.4.4
>>>>
>>>
>>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/