netdev - Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160705170541.3f210675@pixies>
Date:	Tue, 5 Jul 2016 17:05:41 +0300
From:	Shmulik Ladkani <shmulik.ladkani@...ellosystems.com>
To:	Florian Westphal <fw@...len.de>
Cc:	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	shmulik.ladkani@...il.com, netdev@...r.kernel.org
Subject: Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen
 exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

On Tue, 5 Jul 2016 15:03:27 +0200, fw@...len.de wrote:
> > The expected behavior in such a setup would be segmenting the skb first,
> > and then fragmenting each segment according to dst mtu, and finally
> > passing the resulting fragments to ip_finish_output2.
> > 
> > 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> > but it is only considered if IPSKB_FORWARDED is set.
> > 
> > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> > behavior is not considered.
> 
> I placed this test there under the assumption that L2 bridges have
> the same MTU on all bridge ports, so we'd only need to consider routing
> case.

In our setups we have no control of VM mtu (which affects gso_size of
packets arriving from tap), and no control of vxlan's underlay mtu.

> How does work if e.g. 1460-sized udp packet arrives on tap0?
> Do we fragment (possibly ignoring DF?)

A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
vxlan mtu is sufficient), and the original DF of the inner packet is
preserved.

The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
(unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
fragmented normally at the ip_finish_output --> ip_fragment code path.

So on wire we have 2 frags of the vxlan datagram; they are reassembled
at recepient ip stack of vxlan termination. Inner packet preserved.
Not ideal, but works.

The issue is with GSO skbs arriving from tap, which eventually generates
segments larger then the mtu, which are not transmitted on eth0:

  tap0 rx:  super packet, gso_size from user's virtio_net_hdr
    ...
    vxlan0 tx:  encaps the super packet
      ...
      ip_finish_output
        ip_finish_output_gso
          *NO* skb_gso_validate_mtu()     <--- problem here
            ip_finish_output2:  tx the encapsulated super packet on eth0
              ...
              validate_xmit_skb
                netif_needs_gso
                  skb_gso_segment: segments inner payload according to
                                   original gso_size,
                                   leads to vxlan datagrams larger than mtu

> How does it work for non-ip protocols?

The specific problem is with vxlan (or any other udp based tunnel)
encapsulated GSOed packets.

> (Or did I misunderstand this setup...?)

tap0 bridged with vxlan0.
route to vxlan0's remote peer is via eth0, configured with small mtu.

Regards,
Shmulik