netdev - Re: [PATCH 4/3] rcu: documents rculist

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081119170703.GB6753@linux.vnet.ibm.com>
Date:	Wed, 19 Nov 2008 09:07:03 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	David Miller <davem@...emloft.net>,
	Corey Minyard <minyard@....org>,
	Stephen Hemminger <shemminger@...tta.com>,
	benny+usenet@...rsen.dk,
	Linux Netdev List <netdev@...r.kernel.org>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Evgeniy Polyakov <zbr@...emap.net>,
	Christian Bell <christian@...i.com>
Subject: Re: [PATCH 4/3] rcu: documents rculist_nulls

On Thu, Nov 13, 2008 at 05:02:48PM +0100, Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Peter Zijlstra a écrit :
>>> So by not using some memory barriers (would be nice to have it
>>> illustrated which ones), we can race and end up on the wrong chain, in
>>> case that happens we detect this by using this per-chain terminator and
>>> try again.
>>>
>>> It would be really good to have it explained in the rculist_nulls.h
>>> comments what memory barriers are missing, what races they open, and how
>>> the this special terminator trick closes that race.
>> OK, maybe I should add a Documentation/RCU/rculist_nulls.txt file with
>> appropriate examples and documentation.
>> (Say the lookup/insert algorithms, with standard hlist and memory 
>> barriers,
>> and with hlist_nulls without those two memory barriers.
>
> [PATCH 4/3] rcu: documents rculist_nulls

Very good -- only one small suggestion below.

Reviewed-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>

> Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls'
> end-of-list can help in some RCU algos.
>
>
> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
> ---
> Documentation/RCU/rculist_nulls.txt |  167 ++++++++++++++++++++++++++
> 1 files changed, 167 insertions(+)

> diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
> new file mode 100644
> index 0000000..5db5549
> --- /dev/null
> +++ b/Documentation/RCU/rculist_nulls.txt
> @@ -0,0 +1,167 @@
> +Using hlist_nulls to protect read-mostly linked lists and
> +objects using SLAB_DESTROY_BY_RCU allocations.
> +
> +Please read the basics in Documentation/RCU/listRCU.txt
> +
> +Using special makers (called 'nulls') is a convenient way
> +to solve following problem :
> +
> +A typical RCU linked list managing objects which are
> +allocated with SLAB_DESTROY_BY_RCU kmem_cache can
> +use following algos :
> +
> +1) Lookup algo
> +--------------
> +rcu_read_lock()
> +begin:
> +obj = lockless_lookup(key);
> +if (obj) {
> +  if (!try_get_ref(obj)) // might fail for free objects
> +    goto begin;
> +  /*
> +   * Because a writer could delete object, and a writer could
> +   * reuse these object before the RCU grace period, we
> +   * must check key after geting the reference on object
> +   */
> +  if (obj->key != key) { // not the object we expected

In some cases, a "generation number" will be needed.  For example, there
are algorithms that must detect when an object has been removed and then
re-inserted with the same key.  One increments the generation number
on each free and sometimes also on each allocation.

> +     put_ref(obj);
> +     goto begin;
> +   }
> +}
> +rcu_read_unlock();
> +
> +Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
> +but a version with an additional memory barrier (smp_rmb())
> +
> +lockless_lookup(key)
> +{
> +   struct hlist_node *node, *next;
> +   for (pos = rcu_dereference((head)->first);
> +          pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
> +          ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); 
> +          pos = rcu_dereference(next))
> +      if (obj->key == key)
> +         return obj;
> +   return NULL;
> +
> +And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() :
> +
> +   struct hlist_node *node;
> +   for (pos = rcu_dereference((head)->first);
> +		pos && ({ prefetch(pos->next); 1; }) &&
> +		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
> +		pos = rcu_dereference(pos->next))
> +      if (obj->key == key)
> +         return obj;
> +   return NULL;
> +}
> +
> +Quoting Corey Minyard :
> +
> +"If the object is moved from one list to another list in-between the
> + time the hash is calculated and the next field is accessed, and the
> + object has moved to the end of a new list, the traversal will not
> + complete properly on the list it should have, since the object will
> + be on the end of the new list and there's not a way to tell it's on a
> + new list and restart the list traversal.  I think that this can be
> + solved by pre-fetching the "next" field (with proper barriers) before
> + checking the key."
> +
> +2) Insert algo :
> +----------------
> +
> +We need to make sure a reader cannot read the new 'obj->obj_next' value
> +and previous value of 'obj->key'. Or else, an item could be deleted
> +from a chain, and inserted into another chain. If new chain was empty
> +before the move, 'next' pointer is NULL, and lockless reader can
> +not detect it missed following items in original chain.
> +
> +/*
> + * Please note that new inserts are done at the head of list,
> + * not in the middle or end.
> + */
> +obj = kmem_cache_alloc(...);
> +lock_chain(); // typically a spin_lock()
> +obj->key = key;
> +atomic_inc(&obj->refcnt);
> +/*
> + * we need to make sure obj->key is updated before obj->next
> + */
> +smp_wmb();
> +hlist_add_head_rcu(&obj->obj_node, list);
> +unlock_chain(); // typically a spin_unlock()
> +
> +
> +3) Remove algo
> +--------------
> +Nothing special here, we can use a standard RCU hlist deletion.
> +But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused 
> +very very fast (before the end of RCU grace period)
> +
> +if (put_last_reference_on(obj) {
> +   lock_chain(); // typically a spin_lock()
> +   hlist_del_init_rcu(&obj->obj_node);
> +   unlock_chain(); // typically a spin_unlock()
> +   kmem_cache_free(cachep, obj);
> +}
> +
> +
> +
> +--------------------------------------------------------------------------
> +With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
> +and extra smp_wmb() in insert function.
> +
> +For example, if we choose to store the slot number as the 'nulls'
> +end-of-list marker for each slot of the hash table, we can detect
> +a race (some writer did a delete and/or a move of an object
> +to another chain) checking the final 'nulls' value if
> +the lookup met the end of chain. If final 'nulls' value
> +is not the slot number, then we must restart the lookup at
> +the begining. If the object was moved to same chain,
> +then the reader doesnt care : It might eventually
> +scan the list again without harm.
> +
> +
> +1) lookup algo
> +
> + head = &table[slot];
> + rcu_read_lock();
> +begin:
> + hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
> +   if (obj->key == key) {
> +      if (!try_get_ref(obj)) // might fail for free objects
> +         goto begin;
> +      if (obj->key != key) { // not the object we expected
> +         put_ref(obj);
> +         goto begin;
> +      }
> +  goto out;
> + }
> +/*
> + * if the nulls value we got at the end of this lookup is
> + * not the expected one, we must restart lookup.
> + * We probably met an item that was moved to another chain.
> + */
> + if (get_nulls_value(node) != slot)
> +   goto begin;
> + obj = NULL;
> +
> +out:
> + rcu_read_unlock();
> +
> +2) Insert function :
> +--------------------
> +
> +/*
> + * Please note that new inserts are done at the head of list,
> + * not in the middle or end.
> + */
> +obj = kmem_cache_alloc(cachep);
> +lock_chain(); // typically a spin_lock()
> +obj->key = key;
> +atomic_set(&obj->refcnt, 1);
> +/*
> + * insert obj in RCU way (readers might be traversing chain)
> + */
> +hlist_nulls_add_head_rcu(&obj->obj_node, list);
> +unlock_chain(); // typically a spin_unlock()

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html