netdev - Performance regression with netns+linux-bridge+veth since Linux 4.0

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <5620B1F7.5020401@inria.fr>
Date:	Fri, 16 Oct 2015 10:14:47 +0200
From:	Cristian RUIZ <cristian.ruiz@...ia.fr>
To:	Neal Cardwell <ncardwell@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>,
	"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org
CC:	lucas Nussbaum <lucas.nussbaum@...ia.fr>,
	Emmanuel Jeanvoine <emmanuel.jeanvoine@...ia.fr>
Subject: Performance regression with netns+linux-bridge+veth since Linux 4.0

Hello,

When evaluating the performance degradation of running HPC applications 
inside network namespaces
I found that all Linux kernel versions since 4.0 introduce a high 
overhead for some applications.
I compare the execution time of a benchmark (details below) both in a 
"native" execution (without netns), with the execution inside netns.
For Linux 3.16, the overhead measured is 42%. For Linux 3.19, it is 8%.
For Linux 4.0 and later, the overhead measured jumps to 300%.
Using git bisect, I tracked this down to commit 9949afa4 ("tcp: fix 
tcp_cong_avoid_ai() credit accumulation bug with decreases in w").
Reverting commit 9949afa4 on top of Linux 4.2 restores the correct 
(Linux 3.19) behavior.
However, I don't understand why commit 9949afa4 causes such issues. Any 
ideas?

The network setup I used is the following:

         M1                               M2
------------------------        -------------------------
|                       |       |                       |
|  ---------            |       |            ---------  |
|  | veth0 | <-> veth1  |       |  veth1 <-> | veth0 |  |   ...
|  ---------      |     |       |    |       ---------  |
|    netns       br0    |       |   br0        netns    |
|                 |     |       |    |                  |
-----------------eth0---        ----eth0-----------------
                   |                  |
                   |                  |
                   ====== switch ======

Which was created using the following commands:
brctl addbr br0
ip addr add dev br0 192.168.60.101/24 # <- machine ip address
ip link set dev br0 up
brctl addif br0 eth1
ifconfig eth1 0.0.0.0 up
ip link add name veth1 type veth peer name veth0
ip link set veth1 up
brctl addif br0 veth1
ip netns add vnode
ip link set dev int0 netns vnode
ip netns exec vnode ip addr add 10.144.0.3/24 dev int0 # I used a 
different network to communicate among network namespaces.
ip netns exec vnode ip link set dev veth0 up
ifconfig br0:1 10.144.0.252/24 # I assigned an IP address to the bridge
inside the network attributed to the network namespaces.

All machines used in my experiments have this configuration.
The performance issue appears when I run a parallel application inside 
network namespaces
over 8 physical machines using the configuration mentioned above.
The parallel applications used during the tests belong to the NAS 
benchmarks[1]
and the most affected application is CG class B.
The application runs 16 CPU intensive processes per machine (1 process 
per core)
that exchange around 1 GByte of data per execution. (The mapping of 
processes per physical machines is the same for both native and netns 
executions, of course.)
Below, the output of some network counters inside netns using the 
command 'netstat -s':

     98 resets received for embryonic SYN_RECV sockets
     2344 TCP sockets finished time wait in fast timer
     986 delayed acks sent
     2 delayed acks further delayed because of locked socket
     Quick ack mode was activated 34279 times
     75 packets directly queued to recvmsg prequeue.
     2313551 packet headers predicted
     781506 acknowledgments not containing data payload received
     1739746 predicted acknowledgments
     22228 times recovered from packet loss by selective acknowledgements
     Detected reordering 130 times using FACK
     Detected reordering 498 times using SACK
     Detected reordering 68 times using time stamp
     2157 congestion windows fully recovered without slow start
     1465 congestion windows partially recovered using Hoe heuristic
     2273 congestion windows recovered without slow start by DSACK
     140 congestion windows recovered without slow start after partial ack
     59007 fast retransmits
     53276 forward retransmits
     140 other TCP timeouts
     TCPLossProbes: 140

Things to remark here is the high number of congestion windows 
recoveries and the number of TCP timeouts.
For an execution with Linux kernel 3.19 the number of congestion windows 
is 3 times lower and there are few TCP timeouts (around 3) than on 4.0. 
All this is reflected into the maximum network speed achieved by the 
application, 239MB/s (4.0) against 552MB/s (3.19).

More details about the configuration used
=========================================

Hardware used
-------------
All machines are equipped with two CPUs 8 cores each, I attached the 
output of '/proc/cpuinfo'.

Memory: 128 GB
Network: 10 Gigabit Ethernet (eth0)
Driver: ixgbe
Storage:     2*600GB HDD / SAS
Driver: ahci

User-space software
--------------------
Debian jessie, OpenMPI version 1.8.5[2], NAS benchmark version 3.3

Linux kernel
------------
The .config files used are attached in this mail.

[1] https://www.nas.nasa.gov/publications/npb.html
[2] http://www.open-mpi.org/software/ompi/v1.8/

View attachment "config-4.0.0" of type "text/plain" (163657 bytes)

View attachment "config-3.19.0" of type "text/plain" (161886 bytes)

View attachment "cpuinfo" of type "text/plain" (16460 bytes)