[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <536C4733.9020704@redhat.com>
Date: Fri, 09 May 2014 11:10:43 +0800
From: Jason Wang <jasowang@...hat.com>
To: Xi Wang <xii@...gle.com>
CC: "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
Maxim Krasnyansky <maxk@....qualcomm.com>,
Neal Cardwell <ncardwell@...gle.com>,
Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH] net-tun: restructure tun_do_read for better sleep/wakeup
efficiency
On 05/09/2014 02:22 AM, Xi Wang wrote:
> On Tue, May 6, 2014 at 8:40 PM, Jason Wang <jasowang@...hat.com> wrote:
>> On 05/07/2014 08:24 AM, Xi Wang wrote:
>>> tun_do_read always adds current thread to wait queue, even if a packet
>>> is ready to read. This is inefficient because both sleeper and waker
>>> want to acquire the wait queue spin lock when packet rate is high.
>> After commit 61a5ff15ebdab87887861a6b128b108404e4706d, this will only
>> help for blocking read. Looks like for performance critical userspaces,
>> they will use non blocking reads.
>>> We restructure the read function and use common kernel networking
>>> routines to handle receive, sleep and wakeup. With the change
>>> available packets are checked first before the reading thread is added
>>> to the wait queue.
>> This is interesting, since it may help if we want to add rx busy loop
>> for tun. (In fact I worked a similar patch like this).
>
> Yes this should be a good side effect and I am also interested in trying.
> Busy polling in user space is not ideal as it doesn't give the lowest latency.
> Besides differences in interrupt latency etc., there is a bad case for
> non blocking mode: When a packet arrives right before the polling thread
> returns to userspace. The control flow has to cross kernel/userspace
> boundary 3 times before the packet can be processed, while kernel
> blocking or busy polling only needs 1 boundary crossing.
So if we want to implement this, we need a feature bit to turn it on.
Then vhost may benefit from this.
>
>
>>> Ran performance tests with the following configuration:
>>>
>>> - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer
>>> - sender pinned to one core and receiver pinned to another core
>>> - sender send small UDP packets (64 bytes total) as fast as it can
>>> - sandy bridge cores
>>> - throughput are receiver side goodput numbers
>>>
>>> The results are
>>>
>>> baseline: 757k pkts/sec, cpu utilization at 1.54 cpus
>>> changed: 804k pkts/sec, cpu utilization at 1.57 cpus
>>>
>>> The performance difference is largely determined by packet rate and
>>> inter-cpu communication cost. For example, if the sender and
>>> receiver are pinned to different cpu sockets, the results are
>>>
>>> baseline: 558k pkts/sec, cpu utilization at 1.71 cpus
>>> changed: 690k pkts/sec, cpu utilization at 1.67 cpus
>> So I believe your consumer is using blocking reads. How about re-test
>> with non blocking reads and re-test to make sure no regression?
>
> I tested non blocking read and found no regression. However the sender
> is the bottleneck in my case so packet blasting is not a good test for
> non blocking mode. I switched to RR / ping pong type of traffic through tap.
> The packet rates for both cases are ~477k and the difference is way
> below noise.
>
>
>>> Co-authored-by: Eric Dumazet <edumazet@...gle.com>
>>> Signed-off-by: Xi Wang <xii@...gle.com>
>>> ---
>>> drivers/net/tun.c | 68 +++++++++++++++++++++----------------------------------
>>> 1 file changed, 26 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>> index ee328ba..cb25385 100644
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -133,8 +133,7 @@ struct tap_filter {
>>> struct tun_file {
>>> struct sock sk;
>>> struct socket socket;
>>> - struct socket_wq wq;
>>> - struct tun_struct __rcu *tun;
>>> + struct tun_struct __rcu *tun ____cacheline_aligned_in_smp;
>> This seems a optimization which is un-related to the topic. May send as
>> another patch but did you really see improvement for this?
>
> There is an ~1% difference (not as reliable as other data since the difference
> is small). This is not a major performance contributor.
>
>
>>> struct net *net;
>>> struct fasync_struct *fasync;
>>> /* only used for fasnyc */
>>> @@ -498,12 +497,12 @@ static void tun_detach_all(struct net_device *dev)
>>> for (i = 0; i < n; i++) {
>>> tfile = rtnl_dereference(tun->tfiles[i]);
>>> BUG_ON(!tfile);
>>> - wake_up_all(&tfile->wq.wait);
>>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>> RCU_INIT_POINTER(tfile->tun, NULL);
>>> --tun->numqueues;
>>> }
>>> list_for_each_entry(tfile, &tun->disabled, next) {
>>> - wake_up_all(&tfile->wq.wait);
>>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>> RCU_INIT_POINTER(tfile->tun, NULL);
>>> }
>>> BUG_ON(tun->numqueues != 0);
>>> @@ -807,8 +806,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>> /* Notify and wake up reader process */
>>> if (tfile->flags & TUN_FASYNC)
>>> kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>> - wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>> - POLLRDNORM | POLLRDBAND);
>>> + tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>>
>>> rcu_read_unlock();
>>> return NETDEV_TX_OK;
>>> @@ -965,7 +963,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
>>>
>>> tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>>>
>>> - poll_wait(file, &tfile->wq.wait, wait);
>>> + poll_wait(file, sk_sleep(sk), wait);
>>>
>>> if (!skb_queue_empty(&sk->sk_receive_queue))
>>> mask |= POLLIN | POLLRDNORM;
>>> @@ -1330,46 +1328,21 @@ done:
>>> static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>> const struct iovec *iv, ssize_t len, int noblock)
>>> {
>>> - DECLARE_WAITQUEUE(wait, current);
>>> struct sk_buff *skb;
>>> ssize_t ret = 0;
>>> + int peeked, err, off = 0;
>>>
>>> tun_debug(KERN_INFO, tun, "tun_do_read\n");
>>>
>>> - if (unlikely(!noblock))
>>> - add_wait_queue(&tfile->wq.wait, &wait);
>>> - while (len) {
>>> - if (unlikely(!noblock))
>>> - current->state = TASK_INTERRUPTIBLE;
>>> -
>>> - /* Read frames from the queue */
>>> - if (!(skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue))) {
>>> - if (noblock) {
>>> - ret = -EAGAIN;
>>> - break;
>>> - }
>>> - if (signal_pending(current)) {
>>> - ret = -ERESTARTSYS;
>>> - break;
>>> - }
>>> - if (tun->dev->reg_state != NETREG_REGISTERED) {
>>> - ret = -EIO;
>>> - break;
>>> - }
>>> -
>>> - /* Nothing to read, let's sleep */
>>> - schedule();
>>> - continue;
>>> - }
>>> + if (!len)
>>> + return ret;
>>>
>>> + /* Read frames from queue */
>>> + skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
>>> + &peeked, &off, &err);
>>> + if (skb) {
>> This changes the userspace ABI a little bit. Originally, userspace can
>> see different error codes and do responds, but here it can only see zero.
>
> Thanks for catching this! Seems forwarding the &err parameter of
> __skb_recv_datagram
> should get the most of the error code compatibility back?
Seems not, -ERESTARTSYS and EIO were missed.
> I'll check
> related code.
>
>
>>> ret = tun_put_user(tun, tfile, skb, iv, len);
>>> kfree_skb(skb);
>>> - break;
>>> - }
>>> -
>>> - if (unlikely(!noblock)) {
>>> - current->state = TASK_RUNNING;
>>> - remove_wait_queue(&tfile->wq.wait, &wait);
>>> }
>>>
>>> return ret;
>>> @@ -2187,20 +2160,28 @@ out:
>>> static int tun_chr_open(struct inode *inode, struct file * file)
>>> {
>>> struct tun_file *tfile;
>>> + struct socket_wq *wq;
>>>
>>> DBG1(KERN_INFO, "tunX: tun_chr_open\n");
>>>
>>> + wq = kzalloc(sizeof(*wq), GFP_KERNEL);
>>> + if (!wq)
>>> + return -ENOMEM;
>>> +
>> Why not just reusing the socket_wq structure inside tun_file structure
>> like what we did in the past?
>
> There is no strong reason for going either way. Changing to dynamic allocation
> is based on: Less chance of cacheline contention and syncing the code pattern
> with core stack.
It's seems another possible optimization un-related to the topic, better
send with another patch. But I suspect how much it will help for the
performance.
Checking the other socket implementation such as af unix socket, the
socket_wq structure were also embedded in the parent socket structure.
>
>
> -Xi
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists