[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.03.1312031310140.23893@intel.com>
Date: Tue, 3 Dec 2013 13:11:29 -0800 (PST)
From: Joseph Gasparakis <joseph.gasparakis@...el.com>
To: Or Gerlitz <or.gerlitz@...il.com>
cc: Eric Dumazet <eric.dumazet@...il.com>,
Jerry Chu <hkchu@...gle.com>,
Or Gerlitz <ogerlitz@...lanox.com>,
Eric Dumazet <edumazet@...gle.com>,
Alexei Starovoitov <ast@...mgrid.com>,
Pravin B Shelar <pshelar@...ira.com>,
David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>
Subject: Re: vxlan/veth performance issues on net.git + latest kernels
On Tue, 3 Dec 2013, Or Gerlitz wrote:
> On Tue, Dec 3, 2013 at 5:30 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > On Tue, 2013-12-03 at 17:05 +0200, Or Gerlitz wrote:
> >> I've been chasing lately a performance issues which come into play when
> >> combining veth and vxlan over fast Ethernet NIC.
> >>
> >> I came across it while working to enable TCP stateless offloads for
> >> vxlan encapsulated traffic in the mlx4 driver, but I can clearly see the
> >> issue without any HWoffloads involved, so it would be easier to discuss
> >> like that (no offloads involved).
> >>
> >> The setup involves a stacked {veth --> bridge --> vlxan --> IP stack -->
> >> NIC} or {veth --> ovs+vxlan --> IP stack --> NIC} chain.
> >>
> >> Basically, in my testbed which uses iperf over 40Gbs Mellanox NICs,
> >> vxlan traffic goes up to 5-7Gbs for single session and up to 14Gbs for
> >> multiple sessions, as long as veth isn't involved. Once veth is used I
> >> can't get to > 7-8Gbs, no matter how many sessions are used. For the
> >> time being, I manually took into account the tunneling overhead and
> >> reduced the veth pair MTU by 50 bytes.
> >>
> >> Looking on the kernel TCP counters in a {veth --> bridge --> vxlan -->
> >> NIC} configuration, on the client side I see lots of hits for the
> >> following TCP counters (the numbers are just single sample, I look on
> >> the output of iterative sampling every seconds, e.g using "watch -d -n 1
> >> netstat -st"):
> >>
> >> 67092 segments retransmited
> >>
> >> 31461 times recovered from packet loss due to SACK data
> >> Detected reordering 1045142 times using SACK
> >> 436215 fast retransmits
> >> 59966 forward retransmits
> >>
> >> Also on the passive side I see hits for the "Quick ack mode was
> >> activated N times" counter, see below full snapshot of the counters from
> >> both sides.
> >>
> >> Without using veth, e.g when running in a {vxlan -> NIC} or {bridge -->
> >> vxlan --> NIC},I see hits only for the "recovered from packet loss due
> >> to SACK data" counter and fastretransmits counter, but not for the
> >> forward retransmits or "Detected reordering N timesusing SACK". Also,
> >> the quick ack mode counter isn't active on the passive side.
> >>
> >> I tested net.git (3.13-rc2+), 3.12.2 and 3.11.9, I see the same problems
> >> on all. At this point I don't really see a past point to go and apply
> >> bisection. So I hope this counter report can help to shed some light on
> >> the nature of the problem and possible solution, so ideas welcome!!
> >>
> >> without vxlan, these are the Gbs results for 1/2/4 streams over 3.12.2,
> >> the results
> >> for the net.git are pretty much the same.
> >>
> >> 18/32/38 NIC
> >> 17/30/35 bridge --> NIC
> >> 14/23/35 veth --> bridge --> NIC
> >>
> >> with vxlan, these are the Gbs results for 1/2/4 streams
> >>
> >> 6/12/14 vxlan --> IP --> NIC
> >> 5/10/14 bridge --> vxlan --> IP --> NIC
> >> 6/7/7 veth --> bridge --> vxlan --> IP --> NIC
> >>
> >> Also, the 3.12.2 number do get any better also when adding a ported
> >> version of 82d8189826d5 "veth: extend features to support tunneling" on
> >> top of 3.12.2
> >>
> >> See @ the end the sequence of commands I use for the environment
> >>
> >> Or.
> >>
> >>
> >> --> TCP counters from active side
> >>
> >> # netstat -ts
> >> IcmpMsg:
> >> InType0: 2
> >> InType8: 1
> >> OutType0: 1
> >> OutType3: 4
> >> OutType8: 2
> >> Tcp:
> >> 189 active connections openings
> >> 4 passive connection openings
> >> 0 failed connection attempts
> >> 0 connection resets received
> >> 4 connections established
> >> 22403193 segments received
> >> 541234150 segments send out
> >> 14248 segments retransmited
> >> 0 bad segments received.
> >> 5 resets sent
> >> UdpLite:
> >> TcpExt:
> >> 2 invalid SYN cookies received
> >> 178 TCP sockets finished time wait in fast timer
> >> 10 delayed acks sent
> >> Quick ack mode was activated 1 times
> >> 4 packets directly queued to recvmsg prequeue.
> >> 3728 packets directly received from backlog
> >> 2 packets directly received from prequeue
> >> 2524 packets header predicted
> >> 4 packets header predicted and directly queued to user
> >> 19793310 acknowledgments not containing data received
> >> 1216966 predicted acknowledgments
> >> 2130 times recovered from packet loss due to SACK data
> >> Detected reordering 73 times using FACK
> >> Detected reordering 11424 times using SACK
> >> 55 congestion windows partially recovered using Hoe heuristic
> >> TCPDSACKUndo: 457
> >> 2 congestion windows recovered after partial ack
> >> 11498 fast retransmits
> >> 2748 forward retransmits
> >> 2 other TCP timeouts
> >> TCPLossProbes: 4
> >> 3 DSACKs sent for old packets
> >> TCPSackShifted: 1037782
> >> TCPSackMerged: 332827
> >> TCPSackShiftFallback: 598055
> >> TCPRcvCoalesce: 380
> >> TCPOFOQueue: 463
> >> TCPSpuriousRtxHostQueues: 192
> >> IpExt:
> >> InNoRoutes: 1
> >> InMcastPkts: 191
> >> OutMcastPkts: 28
> >> InBcastPkts: 25
> >> InOctets: 1789360097
> >> OutOctets: 893757758988
> >> InMcastOctets: 8152
> >> OutMcastOctets: 3044
> >> InBcastOctets: 4259
> >> InNoECTPkts: 30117553
> >>
> >>
> >>
> >> --> TCP counters from passiveside
> >>
> >> netstat -ts
> >> IcmpMsg:
> >> InType0: 1
> >> InType8: 2
> >> OutType0: 2
> >> OutType3: 5
> >> OutType8: 1
> >> Tcp:
> >> 75 active connections openings
> >> 140 passive connection openings
> >> 0 failed connection attempts
> >> 0 connection resets received
> >> 4 connections established
> >> 146888643 segments received
> >> 27430160 segments send out
> >> 0 segments retransmited
> >> 0 bad segments received.
> >> 6 resets sent
> >> UdpLite:
> >> TcpExt:
> >> 3 invalid SYN cookies received
> >> 72 TCP sockets finished time wait in fast timer
> >> 10 delayed acks sent
> >> 3 delayed acks further delayed because of locked socket
> >> Quick ack mode was activated 13548 times
> >> 4 packets directly queued to recvmsg prequeue.
> >> 2 packets directly received from prequeue
> >> 139384763 packets header predicted
> >> 2 packets header predicted and directly queued to user
> >> 671 acknowledgments not containing data received
> >> 938 predicted acknowledgments
> >> TCPLossProbes: 2
> >> TCPLossProbeRecovery: 1
> >> 14 DSACKs sent for old packets
> >> TCPBacklogDrop: 848
> >
> > Thats bad : Dropping packets on receiver.
> >
> > Check also "ifconfig -a" to see if rxdrop is increasing as well.
> >
> >> TCPRcvCoalesce: 118368414
> >
> > lack of GRO : receiver seems to not be able to receive as fast as you want.
> >
> >> TCPOFOQueue: 3167879
> >
> > So many packets are received out of order (because of losses)
>
> I see that there's no GRO also for the non-veth tests which involve
> vxlan, and over there the receiving side is capable to consume the
> packets, do you have rough explaination why adding veth to the chain
> is such game changer which makes things to start falling out?
>
I have seen this before. Here are my findings:
The gso_type is different if the skb comes from veth or not. From veth,
you will see the SKB_GSO_DODGY set. This breaks things as when the
skb with DODGY set moves from vxlan to the driver through dev_xmit_hard,
the stack drops it silently. I never got the time to find the root cause
for this, but I know it causes re-transmissions and big performance
degregation.
I went as far as just quickly hacking a one liner unsetting the DODGY bit
in vxlan.c and that bypassed the issue and recovered the performance
problem, but obviously this is not a real fix.
>
>
> >
> >> IpExt:
> >> InNoRoutes: 1
> >> InMcastPkts: 184
> >> OutMcastPkts: 26
> >> InBcastPkts: 26
> >> InOctets: 1007364296775
> >> OutOctets: 2433872888
> >> InMcastOctets: 6202
> >> OutMcastOctets: 2888
> >> InBcastOctets: 4597
> >> InNoECTPkts: 702313233
> >>
> >>
> >> client side (node 144)
> >> ----------------------
> >>
> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> >> ifconfig vxlan42 192.168.42.144/24 up
> >>
> >> brctl addbr br-vx
> >> ip link set br-vx up
> >>
> >> ifconfig br-vx 192.168.52.144/24 up
> >> brctl addif br-vx vxlan42
> >>
> >> ip link add type veth
> >> brctl addif br-vx veth1
> >> ifconfig veth0 192.168.62.144/24 up
> >> ip link set veth1 up
> >>
> >> ifconfig veth0 mtu 1450
> >> ifconfig veth1 mtu 1450
> >>
> >>
> >> server side (node 147)
> >> ----------------------
> >>
> >> ip link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dev ethN
> >> ifconfig vxlan42 192.168.42.147/24 up
> >>
> >> brctl addbr br-vx
> >> ip link set br-vx up
> >>
> >> ifconfig br-vx 192.168.52.147/24 up
> >> brctl addif br-vx vxlan42
> >>
> >>
> >> ip link add type veth
> >> brctl addif br-vx veth1
> >> ifconfig veth0 192.168.62.147/24 up
> >> ip link set veth1 up
> >>
> >> ifconfig veth0 mtu 1450
> >> ifconfig veth1 mtu 1450
> >>
> >>
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists