netdev - Re: One way TCP bottleneck over 6 x Gbit tecl aggregated link

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201503222148.00332.wrosner@tirnet.de>
Date:	Sun, 22 Mar 2015 21:47:58 +0100
From:	Wolfgang Rosner <wrosner@...net.de>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: One way TCP bottleneck over 6 x Gbit tecl aggregated link

Hello, Eric,

Am Sonntag, 22. März 2015 17:54:54 schrieb Eric Dumazet:
>
> I see nothing wrong but a single retransmit.
> You might be bitten by tcp_metric cache.
> check /proc/sys/net/ipv4/tcp_no_metrics_save

OK, lets go testing.

reference figures before any changes:
gateway -> blade 	3.21 Gbits/sec
blade -> blade		5.49 Gbits/sec


both on gateway and on blade:

root@...ncher:/home/tmp/netdev# cat /proc/sys/net/ipv4/tcp_no_metrics_save
0
root@...de-002:~# cat /proc/sys/net/ipv4/tcp_no_metrics_save
0

> (set it to 0, and flush tcp metric cache :
> echo 0 >/proc/sys/net/ipv4/tcp_no_metrics_save

no need to do so, , right?

> ip tcp_metrics flush

root@...ncher:/home/tmp/netdev# ip tcp_metrics flush
	Object "tcp_metrics" is unknown, try "ip help".


Looks like debian-stable's iproute tools are quite old, aren't they?
So I suppose the iproute2 package you pointed me to will do the job.
Is it OK this way, or shold I install and configure it on my systems?
Or will the iproute2 from debian backport do?

btw: the teql aggregation is constructed by ip anc tc commands which are part 
of iproute2 package, right. Could the old toolbox, maybe in connection with a 
recent 3.19 vanilla kernel, cause the problem?


/home/build/iproute2/shemminger/iproute2/ip/ip tcp_metrics flush
(both on blades and on gateway)



root@...de-002:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.130.226 port 5001 connected with 192.168.130.250 port 
49522
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  3.78 GBytes  3.25 Gbits/sec


root@...ncher:/home/tmp/netdev# iperf -c 192.168.130.226
------------------------------------------------------------
Client connecting to 192.168.130.226, TCP port 5001
TCP window size:  488 KByte (default)
------------------------------------------------------------
[  3] local 192.168.130.250 port 49522 connected with 192.168.130.226 port 
5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  3.78 GBytes  3.25 Gbits/sec


repeated tests: 3.25, 3.25, 3.12
was 3.12 before.
Sad to say: no effect

>
>
> Also, not clear why you use jumbo frames with 1Gbit NIC (do not do that,
> really), and it seems TSO is disabled on your NIC(s) ????

Because it increased thrughput by a factor of tree.

Jumbo frames was THE ONE AND ONLY really helpful tuning parameter.

What is wrong with jumbo frames in your view?

That's what many TCP tuning tips out there in the internet recommend.
I could follow their argument, that the lesser number of chunks to tranfer 
reduces system load.

And my half-educated guess was, even if GBit NICs do a lot of work in 
hardware / firmware, there is still the teql layer which I suppose is 100 % 
handled in CPU, right?



From my early setup test notes:

default MTU (1500)
	1622.65 MBit / s

jumbo frames (MTU = 9000)
(on both eth links and on teql)
	5143.78 MBit / s

This was on blade-blade links.
If I had this figure on gateway->blade link too, I would be glad and not 
bother you.


> and it seems TSO is disabled on your NIC(s) ????
> for dev in eth0 eth1 eth2 eth3 ....
> do
>  ethtool -k $dev
> done

TSO also known as tcp-segmentation-offload, right?
let me grep a little to keep the mail text readable.
I'll put the unfiltered ethtool -k outputs as attachments for further 
scrutiny.


On the gateway:

root@...ncher:~# for i in `seq 0 9` ; do  ethtool -k  eth$i | grep 'Features\|
segmentation-offload'  ; done
Features for eth0:
tcp-segmentation-offload: off
generic-segmentation-offload: off [requested on]
Features for eth1:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth2:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth3:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth4:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth5:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth6:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth7:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth8:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth9:
tcp-segmentation-offload: on
generic-segmentation-offload: on

Don't care for the first two:
eth0 is the external uplink and mostly used for ssh
eth1 is a 100Mbit link for the blade center admin subnet

This is what we need:
eth2 .... eth7 contribute the teql links towards the blades.


On the blades

root@...de-002:~# for i in `seq 0 5` ; do  ethtool -k  eth$i | 
grep 'Features\|segmentation-offload'  ; done
Features for eth0:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth1:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth2:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth3:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth4:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth5:
tcp-segmentation-offload: off
generic-segmentation-offload: on


This may be due to different card types:

root@...de-002:~# lspci | grep -i ether
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S 
Gigabit Ethernet (rev 12)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S 
Gigabit Ethernet (rev 12)
16:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit 
Ethernet (rev a3)
16:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit 
Ethernet (rev a3)
18:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit 
Ethernet (rev a3)
18:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit 
Ethernet (rev a3)

lspci -k .... 
eth{0,1}
        Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction 
Gigabit Server Adapter
        Kernel driver in use: bnx2

eth{2,3,4,5}
        Subsystem: Hewlett-Packard Company NC325m PCIe Quad Port Adapter
        Kernel driver in use: tg3


I remember some discussion on the web that generic-segmentation-offload would 
be preferrable over tcp-segmentation-offload anyway.
Nevertheless, let's  give it a try, but it does not want to do so:

root@...de-002:~# ethtool -K eth4 tso on
Could not change any device features

Do I interpret this right when I say the card/the driver is not supporting 
TSO?


Just for curiousity:
Tried to set tso off for all cluster link eht? on the gateway, too
	.... 3.31 Gbits/sec -> no effect


but still the old picture: over 2 different connections, I can easyly saturate 
the link:


# iperf -c 192.168.130.226 -P2
------------------------------------------------------------
Client connecting to 192.168.130.226, TCP port 5001
TCP window size:  780 KByte (default)
------------------------------------------------------------
[  4] local 192.168.130.250 port 49543 connected with 192.168.130.226 port 
5001
[  3] local 192.168.130.250 port 49542 connected with 192.168.130.226 port 
5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  3.29 GBytes  2.82 Gbits/sec
[  3]  0.0-10.0 sec  3.30 GBytes  2.84 Gbits/sec
[SUM]  0.0-10.0 sec  6.59 GBytes  5.66 Gbits/sec


And doing it the other way round, I can saturat the link with only one 
connection:

# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  488 KByte (default)
------------------------------------------------------------
[  4] local 192.168.130.250 port 5001 connected with 192.168.130.226 port 
60139
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  6.52 GBytes  5.60 Gbits/sec

root@...de-002:~# iperf -c 192.168.130.250
------------------------------------------------------------
Client connecting to 192.168.130.250, TCP port 5001
TCP window size:  325 KByte (default)
------------------------------------------------------------
[  3] local 192.168.130.226 port 60139 connected with 192.168.130.250 port 
5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  6.52 GBytes  5.60 Gbits/sec


****************************************************************

What puzzles me are the differences in TCP window size:
Does it relate to the 	tcp_metric_cache  issue?

What we also might consider is that one of the physical links is used as "boot 
net"


As you can see, there are two addresses assigned to the first physical link 
(eth6 on gateway, eth0 on blade)
Subnet 192.168.130.0/27 is used for boot (PXE, tftp, nfsroot)
Subnet 192.168.130.32/27 is the first teql assigned subnet.

While there is no traffic blade <-> blade and not much blade ->gateway,
during boot, all the nfsroot traffic is transferred gateway -> blade
right in the direction we encounter bad teql performance.

I don't think that there is much transfer during the iperf tests. 
So it is not an issue of congestion.
But may it be that the boot traffic "skews" the tcp_metric_cache ?
May even just a small number of transfers imbalance the teql flow?


root@...ncher:~# ip route
default via 192.168.0.18 dev eth0
192.168.0.0/24 dev eth0  proto kernel  scope link  src 192.168.0.53
192.168.129.0/24 dev eth1  proto kernel  scope link  src 192.168.129.250
192.168.130.0/27 dev eth6  proto kernel  scope link  src 192.168.130.30
192.168.130.32/27 dev eth6  proto kernel  scope link  src 192.168.130.62
192.168.130.64/27 dev eth7  proto kernel  scope link  src 192.168.130.94
192.168.130.96/27 dev eth4  proto kernel  scope link  src 192.168.130.126
192.168.130.128/27 dev eth5  proto kernel  scope link  src 192.168.130.158
192.168.130.160/27 dev eth2  proto kernel  scope link  src 192.168.130.190
192.168.130.192/27 dev eth3  proto kernel  scope link  src 192.168.130.222
192.168.130.224/27 dev teql0  proto kernel  scope link  src 192.168.130.250


root@...de-002:~# ip ro
default via 192.168.130.30 dev eth0
192.168.130.0/27 dev eth0  proto kernel  scope link  src 192.168.130.2
192.168.130.32/27 dev eth0  proto kernel  scope link  src 192.168.130.34
192.168.130.64/27 dev eth1  proto kernel  scope link  src 192.168.130.66
192.168.130.96/27 dev eth2  proto kernel  scope link  src 192.168.130.98
192.168.130.128/27 dev eth3  proto kernel  scope link  src 192.168.130.130
192.168.130.160/27 dev eth4  proto kernel  scope link  src 192.168.130.162
192.168.130.192/27 dev eth5  proto kernel  scope link  src 192.168.130.194
192.168.130.224/27 dev teql0  proto kernel  scope link  src 192.168.130.226



Sorry when things are weird that way.
But I wouldn't have bothered this venerable list when there had been a simple 
solution....



Wolfgang Rosner

View attachment "eth_offload_blade-002.log" of type "text/x-log" (7694 bytes)

View attachment "eth_offload_gateway.log" of type "text/x-log" (12179 bytes)