[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201503222148.00332.wrosner@tirnet.de>
Date: Sun, 22 Mar 2015 21:47:58 +0100
From: Wolfgang Rosner <wrosner@...net.de>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: netdev@...r.kernel.org
Subject: Re: One way TCP bottleneck over 6 x Gbit tecl aggregated link
Hello, Eric,
Am Sonntag, 22. März 2015 17:54:54 schrieb Eric Dumazet:
>
> I see nothing wrong but a single retransmit.
> You might be bitten by tcp_metric cache.
> check /proc/sys/net/ipv4/tcp_no_metrics_save
OK, lets go testing.
reference figures before any changes:
gateway -> blade 3.21 Gbits/sec
blade -> blade 5.49 Gbits/sec
both on gateway and on blade:
root@...ncher:/home/tmp/netdev# cat /proc/sys/net/ipv4/tcp_no_metrics_save
0
root@...de-002:~# cat /proc/sys/net/ipv4/tcp_no_metrics_save
0
> (set it to 0, and flush tcp metric cache :
> echo 0 >/proc/sys/net/ipv4/tcp_no_metrics_save
no need to do so, , right?
> ip tcp_metrics flush
root@...ncher:/home/tmp/netdev# ip tcp_metrics flush
Object "tcp_metrics" is unknown, try "ip help".
Looks like debian-stable's iproute tools are quite old, aren't they?
So I suppose the iproute2 package you pointed me to will do the job.
Is it OK this way, or shold I install and configure it on my systems?
Or will the iproute2 from debian backport do?
btw: the teql aggregation is constructed by ip anc tc commands which are part
of iproute2 package, right. Could the old toolbox, maybe in connection with a
recent 3.19 vanilla kernel, cause the problem?
/home/build/iproute2/shemminger/iproute2/ip/ip tcp_metrics flush
(both on blades and on gateway)
root@...de-002:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.130.226 port 5001 connected with 192.168.130.250 port
49522
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 3.78 GBytes 3.25 Gbits/sec
root@...ncher:/home/tmp/netdev# iperf -c 192.168.130.226
------------------------------------------------------------
Client connecting to 192.168.130.226, TCP port 5001
TCP window size: 488 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.130.250 port 49522 connected with 192.168.130.226 port
5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 3.78 GBytes 3.25 Gbits/sec
repeated tests: 3.25, 3.25, 3.12
was 3.12 before.
Sad to say: no effect
>
>
> Also, not clear why you use jumbo frames with 1Gbit NIC (do not do that,
> really), and it seems TSO is disabled on your NIC(s) ????
Because it increased thrughput by a factor of tree.
Jumbo frames was THE ONE AND ONLY really helpful tuning parameter.
What is wrong with jumbo frames in your view?
That's what many TCP tuning tips out there in the internet recommend.
I could follow their argument, that the lesser number of chunks to tranfer
reduces system load.
And my half-educated guess was, even if GBit NICs do a lot of work in
hardware / firmware, there is still the teql layer which I suppose is 100 %
handled in CPU, right?
From my early setup test notes:
default MTU (1500)
1622.65 MBit / s
jumbo frames (MTU = 9000)
(on both eth links and on teql)
5143.78 MBit / s
This was on blade-blade links.
If I had this figure on gateway->blade link too, I would be glad and not
bother you.
> and it seems TSO is disabled on your NIC(s) ????
> for dev in eth0 eth1 eth2 eth3 ....
> do
> ethtool -k $dev
> done
TSO also known as tcp-segmentation-offload, right?
let me grep a little to keep the mail text readable.
I'll put the unfiltered ethtool -k outputs as attachments for further
scrutiny.
On the gateway:
root@...ncher:~# for i in `seq 0 9` ; do ethtool -k eth$i | grep 'Features\|
segmentation-offload' ; done
Features for eth0:
tcp-segmentation-offload: off
generic-segmentation-offload: off [requested on]
Features for eth1:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth2:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth3:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth4:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth5:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth6:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth7:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth8:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth9:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Don't care for the first two:
eth0 is the external uplink and mostly used for ssh
eth1 is a 100Mbit link for the blade center admin subnet
This is what we need:
eth2 .... eth7 contribute the teql links towards the blades.
On the blades
root@...de-002:~# for i in `seq 0 5` ; do ethtool -k eth$i |
grep 'Features\|segmentation-offload' ; done
Features for eth0:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth1:
tcp-segmentation-offload: on
generic-segmentation-offload: on
Features for eth2:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth3:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth4:
tcp-segmentation-offload: off
generic-segmentation-offload: on
Features for eth5:
tcp-segmentation-offload: off
generic-segmentation-offload: on
This may be due to different card types:
root@...de-002:~# lspci | grep -i ether
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S
Gigabit Ethernet (rev 12)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S
Gigabit Ethernet (rev 12)
16:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit
Ethernet (rev a3)
16:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit
Ethernet (rev a3)
18:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit
Ethernet (rev a3)
18:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S Gigabit
Ethernet (rev a3)
lspci -k ....
eth{0,1}
Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction
Gigabit Server Adapter
Kernel driver in use: bnx2
eth{2,3,4,5}
Subsystem: Hewlett-Packard Company NC325m PCIe Quad Port Adapter
Kernel driver in use: tg3
I remember some discussion on the web that generic-segmentation-offload would
be preferrable over tcp-segmentation-offload anyway.
Nevertheless, let's give it a try, but it does not want to do so:
root@...de-002:~# ethtool -K eth4 tso on
Could not change any device features
Do I interpret this right when I say the card/the driver is not supporting
TSO?
Just for curiousity:
Tried to set tso off for all cluster link eht? on the gateway, too
.... 3.31 Gbits/sec -> no effect
but still the old picture: over 2 different connections, I can easyly saturate
the link:
# iperf -c 192.168.130.226 -P2
------------------------------------------------------------
Client connecting to 192.168.130.226, TCP port 5001
TCP window size: 780 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.130.250 port 49543 connected with 192.168.130.226 port
5001
[ 3] local 192.168.130.250 port 49542 connected with 192.168.130.226 port
5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 3.29 GBytes 2.82 Gbits/sec
[ 3] 0.0-10.0 sec 3.30 GBytes 2.84 Gbits/sec
[SUM] 0.0-10.0 sec 6.59 GBytes 5.66 Gbits/sec
And doing it the other way round, I can saturat the link with only one
connection:
# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 488 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.130.250 port 5001 connected with 192.168.130.226 port
60139
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 6.52 GBytes 5.60 Gbits/sec
root@...de-002:~# iperf -c 192.168.130.250
------------------------------------------------------------
Client connecting to 192.168.130.250, TCP port 5001
TCP window size: 325 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.130.226 port 60139 connected with 192.168.130.250 port
5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 6.52 GBytes 5.60 Gbits/sec
****************************************************************
What puzzles me are the differences in TCP window size:
Does it relate to the tcp_metric_cache issue?
What we also might consider is that one of the physical links is used as "boot
net"
As you can see, there are two addresses assigned to the first physical link
(eth6 on gateway, eth0 on blade)
Subnet 192.168.130.0/27 is used for boot (PXE, tftp, nfsroot)
Subnet 192.168.130.32/27 is the first teql assigned subnet.
While there is no traffic blade <-> blade and not much blade ->gateway,
during boot, all the nfsroot traffic is transferred gateway -> blade
right in the direction we encounter bad teql performance.
I don't think that there is much transfer during the iperf tests.
So it is not an issue of congestion.
But may it be that the boot traffic "skews" the tcp_metric_cache ?
May even just a small number of transfers imbalance the teql flow?
root@...ncher:~# ip route
default via 192.168.0.18 dev eth0
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.53
192.168.129.0/24 dev eth1 proto kernel scope link src 192.168.129.250
192.168.130.0/27 dev eth6 proto kernel scope link src 192.168.130.30
192.168.130.32/27 dev eth6 proto kernel scope link src 192.168.130.62
192.168.130.64/27 dev eth7 proto kernel scope link src 192.168.130.94
192.168.130.96/27 dev eth4 proto kernel scope link src 192.168.130.126
192.168.130.128/27 dev eth5 proto kernel scope link src 192.168.130.158
192.168.130.160/27 dev eth2 proto kernel scope link src 192.168.130.190
192.168.130.192/27 dev eth3 proto kernel scope link src 192.168.130.222
192.168.130.224/27 dev teql0 proto kernel scope link src 192.168.130.250
root@...de-002:~# ip ro
default via 192.168.130.30 dev eth0
192.168.130.0/27 dev eth0 proto kernel scope link src 192.168.130.2
192.168.130.32/27 dev eth0 proto kernel scope link src 192.168.130.34
192.168.130.64/27 dev eth1 proto kernel scope link src 192.168.130.66
192.168.130.96/27 dev eth2 proto kernel scope link src 192.168.130.98
192.168.130.128/27 dev eth3 proto kernel scope link src 192.168.130.130
192.168.130.160/27 dev eth4 proto kernel scope link src 192.168.130.162
192.168.130.192/27 dev eth5 proto kernel scope link src 192.168.130.194
192.168.130.224/27 dev teql0 proto kernel scope link src 192.168.130.226
Sorry when things are weird that way.
But I wouldn't have bothered this venerable list when there had been a simple
solution....
Wolfgang Rosner
View attachment "eth_offload_blade-002.log" of type "text/x-log" (7694 bytes)
View attachment "eth_offload_gateway.log" of type "text/x-log" (12179 bytes)
Powered by blists - more mailing lists