lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 8 May 2014 11:22:46 -0700
From:	Xi Wang <xii@...gle.com>
To:	Jason Wang <jasowang@...hat.com>
Cc:	"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
	Maxim Krasnyansky <maxk@....qualcomm.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH] net-tun: restructure tun_do_read for better sleep/wakeup efficiency

On Tue, May 6, 2014 at 8:40 PM, Jason Wang <jasowang@...hat.com> wrote:
>
> On 05/07/2014 08:24 AM, Xi Wang wrote:
> > tun_do_read always adds current thread to wait queue, even if a packet
> > is ready to read. This is inefficient because both sleeper and waker
> > want to acquire the wait queue spin lock when packet rate is high.
>
> After commit 61a5ff15ebdab87887861a6b128b108404e4706d, this will only
> help for blocking read. Looks like for performance critical userspaces,
> they will use non blocking reads.
> >
> > We restructure the read function and use common kernel networking
> > routines to handle receive, sleep and wakeup. With the change
> > available packets are checked first before the reading thread is added
> > to the wait queue.
>
> This is interesting, since it may help if we want to add rx busy loop
> for tun. (In fact I worked a similar patch like this).


Yes this should be a good side effect and I am also interested in trying.
Busy polling in user space is not ideal as it doesn't give the lowest latency.
Besides differences in interrupt latency etc., there is a bad case for
non blocking mode: When a packet arrives right before the polling thread
returns to userspace. The control flow has to cross kernel/userspace
boundary 3 times before the packet can be processed, while kernel
blocking or busy polling only needs 1 boundary crossing.


>
> >
> > Ran performance tests with the following configuration:
> >
> >  - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer
> >  - sender pinned to one core and receiver pinned to another core
> >  - sender send small UDP packets (64 bytes total) as fast as it can
> >  - sandy bridge cores
> >  - throughput are receiver side goodput numbers
> >
> > The results are
> >
> > baseline: 757k pkts/sec, cpu utilization at 1.54 cpus
> >  changed: 804k pkts/sec, cpu utilization at 1.57 cpus
> >
> > The performance difference is largely determined by packet rate and
> > inter-cpu communication cost. For example, if the sender and
> > receiver are pinned to different cpu sockets, the results are
> >
> > baseline: 558k pkts/sec, cpu utilization at 1.71 cpus
> >  changed: 690k pkts/sec, cpu utilization at 1.67 cpus
>
> So I believe your consumer is using blocking reads. How about re-test
> with non blocking reads and re-test to make sure no regression?


I tested non blocking read and found no regression. However the sender
is the bottleneck in my case so packet blasting is not a good test for
non blocking mode. I switched to RR / ping pong type of traffic through tap.
The packet rates for both cases are ~477k and the difference is way
below noise.


>
> >
> > Co-authored-by: Eric Dumazet <edumazet@...gle.com>
> > Signed-off-by: Xi Wang <xii@...gle.com>
> > ---
> >  drivers/net/tun.c | 68 +++++++++++++++++++++----------------------------------
> >  1 file changed, 26 insertions(+), 42 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index ee328ba..cb25385 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -133,8 +133,7 @@ struct tap_filter {
> >  struct tun_file {
> >       struct sock sk;
> >       struct socket socket;
> > -     struct socket_wq wq;
> > -     struct tun_struct __rcu *tun;
> > +     struct tun_struct __rcu *tun ____cacheline_aligned_in_smp;
>
> This seems a optimization which is un-related to the topic. May send as
> another patch but did you really see improvement for this?


There is an ~1% difference (not as reliable as other data since the difference
is small). This is not a major performance contributor.


>
> >       struct net *net;
> >       struct fasync_struct *fasync;
> >       /* only used for fasnyc */
> > @@ -498,12 +497,12 @@ static void tun_detach_all(struct net_device *dev)
> >       for (i = 0; i < n; i++) {
> >               tfile = rtnl_dereference(tun->tfiles[i]);
> >               BUG_ON(!tfile);
> > -             wake_up_all(&tfile->wq.wait);
> > +             tfile->socket.sk->sk_data_ready(tfile->socket.sk);
> >               RCU_INIT_POINTER(tfile->tun, NULL);
> >               --tun->numqueues;
> >       }
> >       list_for_each_entry(tfile, &tun->disabled, next) {
> > -             wake_up_all(&tfile->wq.wait);
> > +     tfile->socket.sk->sk_data_ready(tfile->socket.sk);
> >               RCU_INIT_POINTER(tfile->tun, NULL);
> >       }
> >       BUG_ON(tun->numqueues != 0);
> > @@ -807,8 +806,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >       /* Notify and wake up reader process */
> >       if (tfile->flags & TUN_FASYNC)
> >               kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
> > -     wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
> > -                                POLLRDNORM | POLLRDBAND);
> > +     tfile->socket.sk->sk_data_ready(tfile->socket.sk);
> >
> >       rcu_read_unlock();
> >       return NETDEV_TX_OK;
> > @@ -965,7 +963,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait)
> >
> >       tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
> >
> > -     poll_wait(file, &tfile->wq.wait, wait);
> > +     poll_wait(file, sk_sleep(sk), wait);
> >
> >       if (!skb_queue_empty(&sk->sk_receive_queue))
> >               mask |= POLLIN | POLLRDNORM;
> > @@ -1330,46 +1328,21 @@ done:
> >  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
> >                          const struct iovec *iv, ssize_t len, int noblock)
> >  {
> > -     DECLARE_WAITQUEUE(wait, current);
> >       struct sk_buff *skb;
> >       ssize_t ret = 0;
> > +     int peeked, err, off = 0;
> >
> >       tun_debug(KERN_INFO, tun, "tun_do_read\n");
> >
> > -     if (unlikely(!noblock))
> > -             add_wait_queue(&tfile->wq.wait, &wait);
> > -     while (len) {
> > -             if (unlikely(!noblock))
> > -                     current->state = TASK_INTERRUPTIBLE;
> > -
> > -             /* Read frames from the queue */
> > -             if (!(skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue))) {
> > -                     if (noblock) {
> > -                             ret = -EAGAIN;
> > -                             break;
> > -                     }
> > -                     if (signal_pending(current)) {
> > -                             ret = -ERESTARTSYS;
> > -                             break;
> > -                     }
> > -                     if (tun->dev->reg_state != NETREG_REGISTERED) {
> > -                             ret = -EIO;
> > -                             break;
> > -                     }
> > -
> > -                     /* Nothing to read, let's sleep */
> > -                     schedule();
> > -                     continue;
> > -             }
> > +     if (!len)
> > +             return ret;
> >
> > +     /* Read frames from queue */
> > +     skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
> > +                               &peeked, &off, &err);
> > +     if (skb) {
>
> This changes the userspace ABI a little bit. Originally, userspace can
> see different error codes and do responds, but here it can only see zero.


Thanks for catching this! Seems forwarding the &err parameter of
__skb_recv_datagram
should get the most of the error code compatibility back? I'll check
related code.


>
> >               ret = tun_put_user(tun, tfile, skb, iv, len);
> >               kfree_skb(skb);
> > -             break;
> > -     }
> > -
> > -     if (unlikely(!noblock)) {
> > -             current->state = TASK_RUNNING;
> > -             remove_wait_queue(&tfile->wq.wait, &wait);
> >       }
> >
> >       return ret;
> > @@ -2187,20 +2160,28 @@ out:
> >  static int tun_chr_open(struct inode *inode, struct file * file)
> >  {
> >       struct tun_file *tfile;
> > +     struct socket_wq *wq;
> >
> >       DBG1(KERN_INFO, "tunX: tun_chr_open\n");
> >
> > +     wq = kzalloc(sizeof(*wq), GFP_KERNEL);
> > +     if (!wq)
> > +             return -ENOMEM;
> > +
>
> Why not just reusing the socket_wq structure inside tun_file structure
> like what we did in the past?


There is no strong reason for going either way. Changing to dynamic allocation
is based on: Less chance of cacheline contention and syncing the code pattern
with core stack.


-Xi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists