[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49B4B909.7050002@cosmosbay.com>
Date: Mon, 09 Mar 2009 07:36:57 +0100
From: Eric Dumazet <dada1@...mosbay.com>
To: David Miller <davem@...emloft.net>
CC: kchang@...enacr.com, netdev@...r.kernel.org,
cl@...ux-foundation.org, bmb@...enacr.com
Subject: Re: Multicast packet loss
David Miller a écrit :
> From: Eric Dumazet <dada1@...mosbay.com>
> Date: Sun, 08 Mar 2009 17:46:13 +0100
>
>> + if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
>> + if (in_softirq()) {
>> + if (!softirq_del(&sk->sk_del, sock_readable_defer))
>> + goto unlock;
>> + return;
>> + }
>
> This is interesting.
>
> I think you should make softirq_del() more flexible. Make it the
> socket's job to make sure it doesn't try to defer different
> functions, and put the onus on locking there as well.
>
> The cmpxchg() and all of this checking is just wasted work.
>
> I'd really like to get rid of that callback lock too, then we'd
> really be in business. :-)
First thanks for your review David.
I chose cmpxchg() because I needed some form of exclusion here.
I first added a spinlock inside "struct softirq_del" then I realize
I could use cmpxchg() instead and keep the structure small. As the
synchronization is only needed at queueing time, we could pass
the address of a spinlock XXX to sofirq_del() call.
Also, when an event was queued for later invocation, I also needed to keep
a reference on "struct socket" to make sure it doesnt disappear before
the invocation. Not all sockets are RCU guarded (we added RCU only for
some protocols (TCP, UDP ...). So I found keeping a read_lock
on callback was the easyest thing to do. I now realize we might
overflow preempt_count, so special care is needed.
About your first point, maybe we should make sofirq_del() (poor name I confess)
only have one argument (pointer to struct softirq_del), and initialize
the function pointer at socket init time. That would insure "struct softirq_del"
is associated to one callback only. cmpxchg() test would have to be
done on "next" field then (or use the spinlock XXX)
I am not sure output path needs such tricks, since threads are rarely
blocking on output : We dont trigger 400.000 wakeups per second ?
Another point : I did a tbench test and got 2517 MB/s with the patch,
instead of 2538 MB/s (using Linus 2.6 git tree), thats ~ 0.8 % regression
for this workload.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists