netdev - Re: TCP window auto-tuning sub-optimal in GRE tunnel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 25 May 2015 12:05:33 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	"John A. Sullivan III" <jsullivan@...nsourcedevel.com>
Cc:	netdev@...r.kernel.org
Subject: Re: TCP window auto-tuning sub-optimal in GRE tunnel

On Mon, 2015-05-25 at 14:49 -0400, John A. Sullivan III wrote:
> On Mon, 2015-05-25 at 09:58 -0700, Eric Dumazet wrote:
> > On Mon, 2015-05-25 at 11:42 -0400, John A. Sullivan III wrote:
> > > Hello, all.  I hope this is the correct list for this question.  We are
> > > having serious problems on high BDP networks using GRE tunnels.  Our
> > > traces show it to be a TCP Window problem.  When we test without GRE,
> > > throughput is wire speed and traces show the window size to be 16MB
> > > which is what we configured for r/wmem_max and tcp_r/wmem.  When we
> > > switch to GRE, we see over a 90% drop in throughput and the TCP window
> > > size seems to peak at around 500K.
> > > 
> > > What causes this and how can we get the GRE tunnels to use the max
> > > window size? Thanks - John
> > 
> > Hi John
> > 
> > Is it for a single flow or multiple ones ? Which kernel versions on
> > sender and receiver ? What is the nominal speed of non GRE traffic ?
> > 
> > What is the brand/model of receiving NIC  ? Is GRO enabled ?
> > 
> > It is possible receiver window is impacted because of GRE encapsulation
> > making skb->len/skb->truesize ratio a bit smaller, but not by 90%.
> > 
> > I suspect some more trivial issues, like receiver overwhelmed by the
> > extra load of GRE encapsulation.
> > 
> > 1) Non GRE session
> > 
> > lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H lpaa24 -Cc -t OMNI
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
> > tcpi_rtt 70 tcpi_rttvar 7 tcpi_snd_ssthresh 221 tpci_snd_cwnd 258
> > tcpi_reordering 3 tcpi_total_retrans 711
> > Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
> > Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
> > Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
> > Final       Final                                             %     Method %      Method                          
> > 1912320     6291456     16384  10.00   22386.89   10^6bits/s  1.20  S      2.60   S      0.211   0.456   usec/KB  
> > 
> > 2) GRE session
> > 
> > lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 7.7.7.24 -Cc -t OMNI
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.24 () port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 29200
> > tcpi_rtt 76 tcpi_rttvar 7 tcpi_snd_ssthresh 176 tpci_snd_cwnd 249
> > tcpi_reordering 3 tcpi_total_retrans 819
> > Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
> > Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
> > Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
> > Final       Final                                             %     Method %      Method                          
> > 1815552     6291456     16384  10.00   22420.88   10^6bits/s  1.01  S      3.44   S      0.177   0.603   usec/KB  
> > 
> > 
> 
> Thanks, Eric. It really looks like a windowing issue but here is the
> relevant information:
> We are measuring single flows.  One side is an Intel GbE NIC connected
> to a 1 Gbps CIR Internet connection. The other side is an Intel 10 GbE
> NIC connected to a 40 Gbps Internet connection.  RTT is ~=80ms
> 
> The numbers I will post below are from a duplicated setup in our test
> lab where the systems are connected by GbE links with a netem router in
> the middle to introduce the latency.  We are not varying the latency to
> ensure we eliminate packet re-ordering from the mix.
> 
> We are measuring a single flow.  Here are the non-GRE numbers:
> root@...q-1:~# nuttcp -T 60 -i 10 192.168.224.2
>   666.3125 MB /  10.00 sec =  558.9370 Mbps     0 retrans
>  1122.2500 MB /  10.00 sec =  941.4151 Mbps     0 retrans
>   720.8750 MB /  10.00 sec =  604.7129 Mbps     0 retrans
>  1122.3125 MB /  10.00 sec =  941.4622 Mbps     0 retrans
>  1122.2500 MB /  10.00 sec =  941.4101 Mbps     0 retrans
>  1122.3125 MB /  10.00 sec =  941.4668 Mbps     0 retrans
> 
>  5888.5000 MB /  60.19 sec =  820.6857 Mbps 4 %TX 13 %RX 0 retrans 80.28 msRTT
> 
> For some reason, nuttcp does not show retransmissions in our environment
> even when they do exist.
> 
> gro is active on the send side:
> root@...q-1:~# ethtool -k eth0
> Features for eth0:
> rx-checksumming: on
> tx-checksumming: on
>         tx-checksum-ipv4: on
>         tx-checksum-unneeded: off [fixed]
>         tx-checksum-ip-generic: off [fixed]
>         tx-checksum-ipv6: on
>         tx-checksum-fcoe-crc: off [fixed]
>         tx-checksum-sctp: on
> scatter-gather: on
>         tx-scatter-gather: on
>         tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
>         tx-tcp-segmentation: on
>         tx-tcp-ecn-segmentation: off [fixed]
>         tx-tcp6-segmentation: on
> udp-fragmentation-offload: off [fixed]
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off [fixed]
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off [fixed]
> receive-hashing: on
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: on
> loopback: off [fixed]
> 
> and on the receive side:
> root@...tgwingest-1:~# ethtool -k eth5
> Offload parameters for eth5:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off
> rx-vlan-offload: on
> tx-vlan-offload: on
> ntuple-filters: off
> receive-hashing: on
> 
> The CPU is also lightly utilized.  These are fairly high powered
> gateways.  We have measure 16 Gbps throughput on them with no strain at
> all. Checking individual CPUs, we occasionally see one become about half
> occupied with software interrupts.  
> 
> gro is also active on the intermediate netem Linux router.
> lro is disabled.  I gather there is a bug in the ixgbe driver which can
> cause this kind of problem if both gro and lro are enabled.
> 
> Here are the GRE numbers:
> root@...q-1:~# nuttcp -T 60 -i 10 192.168.126.1
>    21.4375 MB /  10.00 sec =   17.9830 Mbps     0 retrans
>    23.2500 MB /  10.00 sec =   19.5035 Mbps     0 retrans
>    23.3125 MB /  10.00 sec =   19.5559 Mbps     0 retrans
>    23.3750 MB /  10.00 sec =   19.6084 Mbps     0 retrans
>    23.2500 MB /  10.00 sec =   19.5035 Mbps     0 retrans
>    23.3125 MB /  10.00 sec =   19.5560 Mbps     0 retrans
> 
>   138.0000 MB /  60.09 sec =   19.2650 Mbps 9 %TX 6 %RX 0 retrans 80.33 msRTT
> 
> 
> Here is top output during GRE testing on the receive side (which is much
> lower powered than the send side):
> 
> top - 14:37:29 up 200 days, 17:03,  1 user,  load average: 0.21, 0.22, 0.17
> Tasks: 186 total,   1 running, 185 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  2.4%sy,  0.0%ni, 93.6%id,  0.0%wa,  0.0%hi,  4.0%si,  0.0%st
> Cpu1  :  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu8  :  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu13 :  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  24681616k total,  1633712k used, 23047904k free,   175016k buffers
> Swap: 25154556k total,        0k used, 25154556k free,  1084648k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 27014 nobody    20   0  6496  912  708 S    6  0.0   0:02.26 nuttcp
>     4 root      20   0     0    0    0 S    0  0.0 101:53.42 kworker/0:0
>    10 root      20   0     0    0    0 S    0  0.0   1020:04 rcu_sched
>    99 root      20   0     0    0    0 S    0  0.0  11:00.02 kworker/1:1
>   102 root      20   0     0    0    0 S    0  0.0  26:01.67 kworker/4:1
>   113 root      20   0     0    0    0 S    0  0.0  24:46.28 kworker/15:1
> 18321 root      20   0  8564 4516  248 S    0  0.0  80:10.20 haveged
> 27016 root      20   0 17440 1396  984 R    0  0.0   0:00.03 top
>     1 root      20   0 24336 2320 1348 S    0  0.0   0:01.39 init
>     2 root      20   0     0    0    0 S    0  0.0   0:00.20 kthreadd
>     3 root      20   0     0    0    0 S    0  0.0 217:16.78 ksoftirqd/0
>     5 root       0 -20     0    0    0 S    0  0.0   0:00.00 kworker/0:0H
> 
> A second nuttcp test shows the same but this time we took a tcpdump of
> the traffic:
> root@...q-1:~# nuttcp -T 60 -i 10 192.168.126.1
>    21.2500 MB /  10.00 sec =   17.8258 Mbps     0 retrans
>    23.2500 MB /  10.00 sec =   19.5035 Mbps     0 retrans
>    23.3750 MB /  10.00 sec =   19.6084 Mbps     0 retrans
>    23.2500 MB /  10.00 sec =   19.5035 Mbps     0 retrans
>    23.3125 MB /  10.00 sec =   19.5560 Mbps     0 retrans
>    23.3750 MB /  10.00 sec =   19.6083 Mbps     0 retrans
> 
>   137.8125 MB /  60.07 sec =   19.2449 Mbps 8 %TX 6 %RX 0 retrans 80.31 msRTT
> 
> MSS is 1436
> Window Scale is 10
> Window size tops out at 545 = 558080
> Hmm . . . I would think if I could send 558080 bytes every 0.080s, that
> would be about 56 Mbps and not 19.5.
> ip -s -s link ls shows no errors on either side.
> 
> I rebooted the receiving side to reset netstat error counters and reran
> the test with the same results.  Nothing jumped out at me in netstat -s:
> 
> TcpExt:
>     1 invalid SYN cookies received
>     1 TCP sockets finished time wait in fast timer
>     187 delayed acks sent
>     2 delayed acks further delayed because of locked socket
>     47592 packets directly queued to recvmsg prequeue.
>     48473682 bytes directly in process context from backlog
>     90710698 bytes directly received in process context from prequeue
>     3085 packet headers predicted
>     88907 packets header predicted and directly queued to user
>     21 acknowledgments not containing data payload received
>     201 predicted acknowledgments
>     3 times receiver scheduled too late for direct processing
>     TCPRcvCoalesce: 677
> 
> Why is my window size so small?
> Here are the receive side settings:
> 
> # increase TCP max buffer size setable using setsockopt()
> net.core.rmem_default = 268800
> net.core.wmem_default = 262144
> net.core.rmem_max = 33564160
> net.core.wmem_max = 33554432
> net.ipv4.tcp_rmem = 8960 89600 33564160
> net.ipv4.tcp_wmem = 4096 65536 33554432
> net.ipv4.tcp_mtu_probing=1
> 
> and here are the transmit side settings:
> # increase TCP max buffer size setable using setsockopt()
>   net.core.rmem_default = 268800
>   net.core.wmem_default = 262144
>   net.core.rmem_max = 33564160
>   net.core.wmem_max = 33554432
>   net.ipv4.tcp_rmem = 8960 89600 33564160
>   net.ipv4.tcp_wmem = 4096 65536 33554432
>   net.ipv4.tcp_mtu_probing=1
>   net.core.netdev_max_backlog = 3000
> 
> 
> Oh, kernel versions:
> sender: root@...q-1:~# uname -a
> Linux gwhq-1 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u1 x86_64 GNU/Linux
> 
> receiver:
> root@...tgwingest-1:/etc# uname -a
> Linux testgwingest-1 3.8.0-38-generic #56~precise1-Ubuntu SMP Thu Mar 13 16:22:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> 
> Thanks - John

Nothing seems giving a hint here.

Coud you post netem setup, and maybe full "tc -s qdisc" output for this
netem host ?


Also, you could use nstat at the sender this way, so that we might have
some clue :

nstat >/dev/null
nuttcp -T 60 -i 10 192.168.126.1
nstat



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html