[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081029185200.GE6732@linux.vnet.ibm.com>
Date: Wed, 29 Oct 2008 11:52:00 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Corey Minyard <minyard@....org>
Cc: Eric Dumazet <dada1@...mosbay.com>,
David Miller <davem@...emloft.net>, shemminger@...tta.com,
benny+usenet@...rsen.dk, netdev@...r.kernel.org,
Christoph Lameter <cl@...ux-foundation.org>,
a.p.zijlstra@...llo.nl, johnpol@....mipt.ru,
Christian Bell <christian@...i.com>
Subject: Re: [PATCH 2/2] udp: RCU handling for Unicast packets.
On Wed, Oct 29, 2008 at 01:28:15PM -0500, Corey Minyard wrote:
> Eric Dumazet wrote:
>> Corey Minyard a écrit :
>>> Paul E. McKenney wrote:
>>>> On Wed, Oct 29, 2008 at 05:09:53PM +0100, Eric Dumazet wrote:
>>>>
>>>>> Corey Minyard a écrit :
>>>>>
>>>>>> Eric Dumazet wrote:
>>>>>>
>>>>>>> Corey Minyard found a race added in commit
>>>>>>> 271b72c7fa82c2c7a795bc16896149933110672d
>>>>>>> (udp: RCU handling for Unicast packets.)
>>>>>>>
>>>>>>> "If the socket is moved from one list to another list in-between the
>>>>>>> time the hash is calculated and the next field is accessed, and the
>>>>>>> socket has moved to the end of the new list, the traversal will not
>>>>>>> complete properly on the list it should have, since the socket will
>>>>>>> be on the end of the new list and there's not a way to tell it's on
>>>>>>> a new list and restart the list traversal. I think that this can be
>>>>>>> solved by pre-fetching the "next" field (with proper barriers)
>>>>>>> before checking the hash."
>>>>>>>
>>>>>>> This patch corrects this problem, introducing a new
>>>>>>> sk_for_each_rcu_safenext()
>>>>>>> macro.
>>>>>>>
>>>>>> You also need the appropriate smp_wmb() in udp_lib_get_port() after
>>>>>> sk_hash is set, I think, so the next field is guaranteed to be changed
>>>>>> after the hash value is changed.
>>>>>>
>>>>> Not sure about this one Corey.
>>>>>
>>>>> If a reader catches previous value of item->sk_hash, two cases are to
>>>>> be taken into :
>>>>>
>>>>> 1) its udp_hashfn(net, sk->sk_hash) is != hash -> goto begin : Reader
>>>>> will redo its scan
>>>>>
>>>>> 2) its udp_hashfn(net, sk->sk_hash) is == hash
>>>>> -> next pointer is good enough : it points to next item in same hash
>>>>> chain.
>>>>> No need to rescan the chain at this point.
>>>>> Yes we could miss the fact that a new port was bound and this UDP
>>>>> message could be lost.
>>>>>
>>>>
>>>> 3) its udp_hashfn(net, sk-sk_hash) is == hash, but only because it was
>>>> removed, freed, reallocated, and then readded with the same hash value,
>>>> possibly carrying the reader to a new position in the same list.
>>>>
>>> If I understand this, without the smp_wmb(), it is possible that the next
>>> field can be written to main memory before the hash value is written. If
>>> that happens, the following can occur:
>>>
>>> CPU1 CPU2
>>> next is set to NULL (end of new list)
>>
>> Well, if this item is injected to the same chain, next wont be set to
>> NULL.
>>
>> That would mean previous writers deleted all items from the chain.
> I put my comment in the wrong place, I wasn't talking about being injected
> into the same chain.
>
>>
>> In this case, readers can see NULL, it is not a problem at all.
>> List is/was empty.
>> An application cannot complain a packet is not
>> handled if its bind() syscall is not yet completed :)
>>
>> If item is injected on another chain, we will detect hash mismatch and
>> redo full scan.
> If the item is injected onto the end of another chain, the next field will
> be NULL and you won't detect a hash mismatch. It's basically the same
> issue as the previous race, though a lot more subtle and unlikely. If you
> get (from the previous socket) an old value of "sk_hash" and (from the new
> socket) a new value of "next" that is NULL, you will terminate the loop
> when you should have restarted it. I'm pretty sure that can occur without
> the write barrier.
One way of dealing with this is to keep a tail pointer. Then, if the
element containing the NULL pointer doesn't match the tail pointer seen
at the start of the search, or if the tail pointer has changed,
restart the search. Memory barriers will be needed. ;-)
Thanx, Paul
>>> fetch next
>>> calculate hash and compare to sk_hash
>>> sk_hash is set to new value
>>>
>>> So I think in the above cases, your case #2 is not necessarily valid
>>> without the barrier.
>>>
>>> And another possible issue. If sk_hash is written before next, and CPU1
>>> is interrupted before CPU2, CPU2 will continually spin on the list until
>>> CPU1 comes back and moves it to the new list. Note sure if that is an
>>> issue.
>>
>> Probably not. Previously, readers were spining on read_lock(), when a
>> writer was inside its critical section (write_lock()/write_unlock()).
>> So instead of spining inside read_unlock(), issuing stupid memory
>> transactions, the readers can now spin reading hash chain and populate
>> cpu cache :)
> Yes, I thought about that and thought I would point it out, but I agree,
> what you have is certainly better than spinning on a lock :).
>
>
> -corey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists