[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <4909E0A4.7060009@acm.org>
Date: Thu, 30 Oct 2008 11:28:20 -0500
From: Corey Minyard <minyard@....org>
To: Stephen Hemminger <shemminger@...tta.com>
Cc: Eric Dumazet <dada1@...mosbay.com>, paulmck@...ux.vnet.ibm.com,
David Miller <davem@...emloft.net>, benny+usenet@...rsen.dk,
netdev@...r.kernel.org,
Christoph Lameter <cl@...ux-foundation.org>,
a.p.zijlstra@...llo.nl, johnpol@....mipt.ru,
Christian Bell <christian@...i.com>
Subject: Re: [PATCH] udp: Introduce special NULL pointers for hlist termination
Stephen Hemminger wrote:
> On Thu, 30 Oct 2008 16:40:01 +0100
> Eric Dumazet <dada1@...mosbay.com> wrote:
>
>
>> Eric Dumazet a écrit :
>>
>>> Paul E. McKenney a écrit :
>>>
>>>> On Wed, Oct 29, 2008 at 09:00:13PM +0100, Eric Dumazet wrote:
>>>>
>>>>> Hum... Another way of handling all those cases and avoid memory barriers
>>>>> would be to have different "NULL" pointers.
>>>>>
>>>>> Each hash chain should have a unique "NULL" pointer (in the case of
>>>>> UDP, it
>>>>> can be the 128 values : [ (void*)0 .. (void *)127 ]
>>>>>
>>>>> Then, when performing a lookup, a reader should check the "NULL" pointer
>>>>> it get at the end of its lookup has is the "hash" value of its chain.
>>>>>
>>>>> If not -> restart the loop, aka "goto begin;" :)
>>>>>
>>>>> We could avoid memory barriers then.
>>>>>
>>>>> In the two cases Corey mentioned, this trick could let us avoid
>>>>> memory barriers.
>>>>> (existing one in sk_add_node_rcu(sk, &hslot->head); should be enough)
>>>>>
>>>>> What do you think ?
>>>>>
>>>> Kinky!!! ;-)
>>>>
>>>> Then the rcu_dereference() would be supplying the needed memory barriers.
>>>>
>>>> Hmmm... I guess that the only confusion would be if the element got
>>>> removed and then added to the same list. But then if its pointer was
>>>> pseudo-NULL, then that would mean that all subsequent elements had been
>>>> removed, and all preceding ones added after the scan started.
>>>>
>>>> Which might well be harmless, but I must defer to you on this one at
>>>> the moment.
>>>>
>>>> If you need a larger hash table, another approach would be to set the
>>>> pointer's low-order bit, allowing the upper bits to be a full-sized
>>>> index -- or even a pointer to the list header. Just make very sure
>>>> to clear the pointer when freeing, or an element on the freelist
>>>> could end up looking like a legitimate end of list... Which again
>>>> might well be safe, but why inflict this on oneself?
>>>>
>> Ok, here is an updated and tested patch.
>>
>> Thanks everybody
>>
>> [PATCH] udp: Introduce special NULL pointers for hlist termination
>>
>> In order to safely detect changes in chains, we would like to have different
>> 'NULL' pointers. Each chain in hash table is terminated by an unique 'NULL'
>> value, so that the lockless readers can detect their lookups evaded from
>> their starting chain.
>>
>> We introduce a new type of hlist implementation, named hlist_nulls, were
>> we use the least significant bit of the 'ptr' to tell if its a "NULL" value
>> or a pointer to an object. We expect to use this new hlist variant for TCP
>> as well.
>>
>> For UDP/UDP-Lite hash table, we use 128 different "NULL" values,
>> (UDP_HTABLE_SIZE=128)
>>
>> Using hlist_nulls saves memory barriers (a read barrier to fetch 'next'
>> pointers *before* checking key values) we added in commit
>> 96631ed16c514cf8b28fab991a076985ce378c26
>> (udp: introduce sk_for_each_rcu_safenext())
>>
>> This also saves a write memory barrier in udp_lib_get_port(), between
>> sk->sk_hash update and sk->next update)
>>
>> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
>> ---
>>
>
> IMHO this goes over the edge into tricky hack. Is it really worth it?
> Is there a better simpler way?
>
The only think I've thought of is to do a single smp_rmb() after the
loop scanning the list and check the sk_hash value again. That's better
than the read barrier for every list element, but still not as good as
this list from a performance point of view.
IMHO, this is a tricky hack, but it if is well abstracted and documented
I think it's ok. I'd guess something like this will become more often
used as we get larger numbers of processors on systems.
It is annoying that it doesn't help the performance for multicast.
However, I think the current patch will solve the DOS issue for
multicast, since it switches to a normal spinlock and has a per-list lock.
-corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists