[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180420231311.xissamn6go5fznua@kafai-mbp>
Date: Fri, 20 Apr 2018 16:13:13 -0700
From: Martin KaFai Lau <kafai@...com>
To: Julian Anastasov <ja@....bg>
CC: <netdev@...r.kernel.org>, Tom Herbert <tom@...bertland.com>,
Eric Dumazet <eric.dumazet@...il.com>,
Nikita Shirokov <tehnerd@...com>, <kernel-team@...com>
Subject: Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
>
> Hello,
>
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
>
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> >
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> >
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing. In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> >
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU. In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> >
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
>
> Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
>
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
>
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()
One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.
>
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver. The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> >
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> >
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412] ? skb_release_data+0xfc/0x120
> > [ 1129.029587] ? kfree_skbmem+0x64/0x70
> > [ 1129.036905] dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262] sch_direct_xmit+0x10f/0x280
> > [ 1129.053111] __qdisc_run+0x223/0x5a0
> > [ 1129.060251] __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268] dev_queue_xmit+0x10/0x20
> > [ 1129.075573] ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218] ip6_finish_output2+0x2db/0x490
> > [ 1129.091573] ip6_finish_output+0x125/0x190
> > [ 1129.099754] ip6_output+0x5f/0x100
> > [ 1129.106548] ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212] ip6_local_out+0x35/0x40
> > [ 1129.121356] ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443] ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837] ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578] ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493] ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273] ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109] ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849] nf_hook_slow+0x3e/0xc0
> > [ 1129.195800] ip6_xmit+0x30b/0x540
> > [ 1129.202421] ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909] inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207] ? lock_timer_base+0x76/0xa0
> > [ 1129.225043] tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051] tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885] __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106] tcp_push+0xde/0x100
> > [ 1129.256554] tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910] tcp_sendmsg+0x2c/0x50
> > [ 1129.271703] inet_sendmsg+0x31/0xb0
> > [ 1129.278672] sock_write_iter+0xf8/0x110
> > [ 1129.286335] new_sync_write+0xd9/0x120
> > [ 1129.293823] vfs_write+0x18d/0x1e0
> > [ 1129.300614] SyS_write+0x48/0xa0
> > [ 1129.307045] do_syscall_64+0x69/0x1e0
> > [ 1129.314361] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> >
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> >
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> >
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> > inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> > valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> > inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> > valid_lft forever preferred_lft forever
> >
> > [host-a] > cat /proc/net/ip_vs
> > TCP [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> > -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB Tunnel 6772 9 6
> > -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB Tunnel 6772 8 6
> > -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB Tunnel 6772 19 7
> >
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> >
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue. Like this:
>
> Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
>
> >
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> >
> > Signed-off-by: Martin KaFai Lau <kafai@...com>
> > ---
> > net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> > 1 file changed, 33 insertions(+), 16 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> > ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> > }
> >
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> > int rt_mode,
> > struct ip_vs_iphdr *ipvsh,
> > struct sk_buff *skb, int mtu)
> > {
> > + struct netns_ipvs *ipvs = cp->ipvs;
> > +
> > #ifdef CONFIG_IP_VS_IPV6
> > - if (skb_af == AF_INET6) {
> > + if (cp->af == AF_INET6) {
> > struct net *net = ipvs->net;
> >
> > if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > }
> > }
> >
> > + if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > + const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > + unsigned short hdr_len = (th->doff << 2) +
> > + skb_network_header_len(skb);
> > +
> > + if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > + skb_decrease_gso_size(skb_shinfo(skb),
> > + skb_shinfo(skb)->gso_size -
> > + (mtu - hdr_len));
>
> So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior. Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.
> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.
Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).
Thanks for looking into it!
Martin
> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
>
> Regards
>
> --
> Julian Anastasov <ja@....bg>
Powered by blists - more mailing lists