lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <53957C85.4040700@pobox.com>
Date:	Mon, 09 Jun 2014 10:21:09 +0100
From:	Brian Candler <b.candler@...ox.com>
To:	linux-kernel@...r.kernel.org
Subject: tap interface apparently rate-limited to 1 packet per 12ms

[Aside: I posted this to netdev, no reply. I apologise if you have seen 
this before]

I have a performance problem which I've narrowed down to some bizarre 
behaviour of tap interfaces, and I've made an easily reproducible test 
case.

It seems that no matter how fast you queue packets for a tap interface, 
they are only being forwarded at one packet every 10 to 12 ms, limiting 
you to around 80-100 packets per second. This is as shown by tcpdump on 
the tap interface - not even on the userland process listening on the tap.

This also happens with tun interfaces, which I'd have thought would 
affect a lot of VPN software, so quite possibly there's something under 
the surface which I don't understand.

To reproduce, open three root windows.

(1)

gcc -Wall -o use_tap use_tap.c     # see source code below
./use_tap # just creates tap interface and keeps it open

(2)

tcpdump -i tap99 -nn

(3)

ifconfig tap99 1.1.1.2/30 up
arp -i tap99 -s 1.1.1.1 11:11:11:11:11:11
ping -fq -c20 1.1.1.1

You will see on the tcpdump screen something like this:

# tcpdump -i tap99 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap99, link-type EN10MB (Ethernet), capture size 65535 bytes
12:17:36.157450 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 1, 
length 64
12:17:36.169374 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 2, 
length 64
12:17:36.181378 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 3, 
length 64
12:17:36.193373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 4, 
length 64
12:17:36.205373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 5, 
length 64
12:17:36.217381 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 6, 
length 64
12:17:36.229373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 7, 
length 64
12:17:36.241372 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 8, 
length 64
12:17:36.253373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 9, 
length 64
12:17:36.265373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
10, length 64
12:17:36.277373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
11, length 64
12:17:36.289372 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
12, length 64
12:17:36.301373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
13, length 64
12:17:36.313373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
14, length 64
12:17:36.325372 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
15, length 64
12:17:36.337372 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
16, length 64
12:17:36.349373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
17, length 64
12:17:36.361373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
18, length 64
12:17:36.373374 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
19, length 64
12:17:36.385373 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 6417, seq 
20, length 64

Look at the timetamps of the packets. They are 0.012 seconds apart exactly.

The above captures were made on an Intel DN2820 NUC running Debian 
Wheezy with kernel 3.2.0-4-amd64

I can find no evidence of configured rate limiting:

# tc qdisc show dev tap99
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
1 1 1 1 1

Now: I'm only using ping -fq as a simple way to generate packets. You 
can easily demonstrate to yourself that this is not the source of the 
delay (e.g. by sending the same ping -fq command to a random device on 
your network, and looking at the timestamps of the outgoing pings; 
expect to see 10 or more pings within the same millisecond).

The real-world problem I am trying to fix is TCP throughput with 
dynamips. Here's the setup:

<KVM vnet0> ---- <br1> --- <tap1 DYNAMIPS1 tap11> --- <br-lan> --- host
<KVM vnet1> ---'

Here the platform is a Mac Mini i7 running Ubuntu 12.04.4 with kernel 
3.8.0-41, and dynamips 0.2.12.

When using wget on a KVM instance to fetch a large file form the host, 
the total TCP throughput I was observing through the dynamips router was 
around 130KB/s. Furthermore, if I ran a wget on the second KVM at the 
same time, then the total bandwidth was divided between them. But if I 
started another download from another KVM via another dynamips instance 
(br2/tap2/tap12) then it got its own independent 130KB/s without 
affecting the ones behind br1. CPU usage of dynamips was moderate (about 
half a core of this four-core box).

Transfers between two KVM instances both locally attached to br1 are 
very fast.

By comparing tcpdump on br-lan with tcpdump on tap11 (both taken at the 
same time on the same host), I could see that the host was sending a 
window-full of TCP segments in quick succession, but they were being 
drip-fed out of tap11.

Here's what I saw on br-lan:

# tcpdump -r br-lan.pcap -nn | head -20
reading from file br-lan.pcap, link-type EN10MB (Ethernet)
10:24:01.446755 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [S], seq 
1242930289, win 14600, options [mss 1460,sackOK,TS val 1325806 ecr 
0,nop,wscale 3], length 0
10:24:01.446805 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [S.], seq 
976209917, ack 1242930290, win 14480, options [mss 1460,sackOK,TS val 
2606216 ecr 1325806,nop,wscale 7], length 0
10:24:01.466969 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 1, 
win 1825, options [nop,nop,TS val 1325819 ecr 2606216], length 0
10:24:01.497357 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [P.], seq 
1:150, ack 1, win 1825, options [nop,nop,TS val 1325819 ecr 2606216], 
length 149
10:24:01.497395 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], ack 150, 
win 122, options [nop,nop,TS val 2606229 ecr 1325819], length 0
10:24:01.497705 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
1:1449, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 1325819], 
length 1448
10:24:01.497720 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
1449:2897, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.497726 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
2897:4345, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.497732 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
4345:5793, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.497737 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
5793:7241, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.497743 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
7241:8689, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.527775 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
1449, win 2187, options [nop,nop,TS val 1325834 ecr 2606229], length 0
10:24:01.527827 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
8689:10137, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448
10:24:01.527839 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
10137:11585, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448
10:24:01.527847 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [P.], seq 
11585:12596, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1011
10:24:01.527856 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
12596:14044, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448
10:24:01.527862 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
14044:15492, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448
10:24:01.629571 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
2897, win 2549, options [nop,nop,TS val 1325837 ecr 2606229], length 0
10:24:01.629593 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [P.], seq 
15492:16940, ack 150, win 122, options [nop,nop,TS val 2606262 ecr 
1325837], length 1448
10:24:01.639742 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
4345, win 2911, options [nop,nop,TS val 1325839 ecr 2606229], length 0

(notice burst of packets with timestamp 10:24:01.497xxx). 10.10.1.1 is 
the KVM VM, and 10.10.0.241 is the host's own address on br-lan.

And on tap1:

root@...1:~# tcpdump -r tap1.pcap -nn | head -20
reading from file tap1.pcap, link-type EN10MB (Ethernet)
10:24:01.406738 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [S], seq 
1242930289, win 14600, options [mss 1460,sackOK,TS val 1325806 ecr 
0,nop,wscale 3], length 0
10:24:01.456885 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [S.], seq 
976209917, ack 1242930290, win 14480, options [mss 1460,sackOK,TS val 
2606216 ecr 1325806,nop,wscale 7], length 0
10:24:01.457031 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 1, 
win 1825, options [nop,nop,TS val 1325819 ecr 2606216], length 0
10:24:01.457155 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [P.], seq 
1:150, ack 1, win 1825, options [nop,nop,TS val 1325819 ecr 2606216], 
length 149
10:24:01.507498 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], ack 150, 
win 122, options [nop,nop,TS val 2606229 ecr 1325819], length 0
10:24:01.517653 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
1:1449, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 1325819], 
length 1448
10:24:01.517794 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
1449, win 2187, options [nop,nop,TS val 1325834 ecr 2606229], length 0
10:24:01.527756 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
1449:2897, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.527866 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
2897, win 2549, options [nop,nop,TS val 1325837 ecr 2606229], length 0
10:24:01.537982 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
2897:4345, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.538089 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
4345, win 2911, options [nop,nop,TS val 1325839 ecr 2606229], length 0
10:24:01.548166 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
4345:5793, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.548363 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
5793, win 3273, options [nop,nop,TS val 1325842 ecr 2606229], length 0
10:24:01.558361 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
5793:7241, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.558455 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
7241, win 3635, options [nop,nop,TS val 1325844 ecr 2606229], length 0
10:24:01.568493 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
7241:8689, ack 150, win 122, options [nop,nop,TS val 2606229 ecr 
1325819], length 1448
10:24:01.568654 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
8689, win 3997, options [nop,nop,TS val 1325847 ecr 2606229], length 0
10:24:01.588911 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
8689:10137, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448
10:24:01.589087 IP 10.10.1.1.38590 > 10.10.0.241.80: Flags [.], ack 
10137, win 4359, options [nop,nop,TS val 1325852 ecr 2606236], length 0
10:24:01.599135 IP 10.10.0.241.80 > 10.10.1.1.38590: Flags [.], seq 
10137:11585, ack 150, win 122, options [nop,nop,TS val 2606236 ecr 
1325834], length 1448

These appear to be at intervals of about 10.2 ms, or 98 packets per 
second. tcpdump on the KVM VM also saw the packets arrive at these 
intervals.

So with a segment size of 1448 bytes, and limited to 98 packets per 
second, this gives a theoretical throughput of 142KB/sec, close to what 
I measured.

I have one more machine I've tried this on: an older AMD Athlon X2 but 
running Ubuntu 14.04 with kernel 3.13.0-27-generic. Using the 
reproducing test case given at the start of this mail, I see:

# tcpdump -i tap99 -nn
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap99, link-type EN10MB (Ethernet), capture size 65535 bytes
13:28:00.381053 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 1, 
length 64
13:28:00.402796 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 2, 
length 64
13:28:00.414789 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 3, 
length 64
13:28:00.426785 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 4, 
length 64
13:28:00.438777 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 5, 
length 64
13:28:00.450781 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 6, 
length 64
13:28:00.462781 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 7, 
length 64
13:28:00.474783 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 8, 
length 64
13:28:00.486768 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 9, 
length 64
13:28:00.498768 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
10, length 64
13:28:00.510766 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
11, length 64
13:28:00.522767 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
12, length 64
13:28:00.534767 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
13, length 64
13:28:00.546766 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
14, length 64
13:28:00.558766 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
15, length 64
13:28:00.570766 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
16, length 64
13:28:00.582767 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
17, length 64
13:28:00.594767 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
18, length 64
13:28:00.606768 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
19, length 64
13:28:00.618766 IP 1.1.1.2 > 1.1.1.1: ICMP echo request, id 5141, seq 
20, length 64

These all appear to be at intervals of 12ms too.

I'm stumped as to what could be the cause of this - does anyone have any 
suggestions?

Thanks,

Brian.

----- 8< -----
/* http://backreference.org/2010/03/26/tuntap-interface-tutorial/ */

#include <net/if.h>
#include <linux/if_tun.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>

#include <stdio.h>
#include <unistd.h>

int tun_alloc(char *dev, int flags) {

   struct ifreq ifr;
   int fd, err;
   char *clonedev = "/dev/net/tun";

   /* Arguments taken by the function:
    *
    * char *dev: the name of an interface (or '\0'). MUST have enough
    *   space to hold the interface name if '\0' is passed
    * int flags: interface flags (eg, IFF_TUN etc.)
    */

    /* open the clone device */
    if( (fd = open(clonedev, O_RDWR)) < 0 ) {
      return fd;
    }

    /* preparation of the struct ifr, of type "struct ifreq" */
    memset(&ifr, 0, sizeof(ifr));

    ifr.ifr_flags = flags;   /* IFF_TUN or IFF_TAP, plus maybe IFF_NO_PI */

    if (*dev) {
      /* if a device name was specified, put it in the structure; otherwise,
       * the kernel will try to allocate the "next" device of the
       * specified type */
      strncpy(ifr.ifr_name, dev, IFNAMSIZ);
    }

    /* try to create the device */
    if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ) {
      close(fd);
      return err;
    }

   /* if the operation was successful, write back the name of the
    * interface to the variable "dev", so the caller can know
    * it. Note that the caller MUST reserve space in *dev (see calling
    * code below) */
   strcpy(dev, ifr.ifr_name);

   /* this is the special file descriptor that the caller will use to talk
    * with the virtual interface */
   return fd;
}

int main(void) {
   char buf[] = "tap99";
   int fd = tun_alloc(buf, IFF_TAP);
   if (fd < 0) { perror("tun_alloc"); return 1; }
   fprintf(stderr,"%s open\n", buf);
   while(1) { sleep(3600); }
   return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ