netdev - Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150519123415.GC5063@breakpoint.cc>
Date:	Tue, 19 May 2015 14:34:15 +0200
From:	Florian Westphal <fw@...len.de>
To:	David Miller <davem@...emloft.net>
Cc:	fw@...len.de, netdev@...r.kernel.org, hannes@...essinduktion.org,
	edumazet@...gle.com, herbert@...dor.apana.org.au
Subject: Re: [PATCH -next] net: preserve geometry of fragment sizes when
 forwarding

David Miller <davem@...emloft.net> wrote:
> From: Florian Westphal <fw@...len.de>
> Date: Mon, 18 May 2015 23:33:29 +0200
> 
> > Thanks for spending time on this.
> 
> Ok, I've heard what you have to say.
> 
> Of the fixes you've proposed already, I prefer the device MTU one
> because it doesn't penalize the ip_fragment.c optimizations just
> because a netfilter module is loaded.

True, OTOH it only allows removing the br_nf hacks, it doesn't fix
a few corner cases (see below).

> Which of your already proposed patches do you prefer, and why?

The IPCB one, main change from that set:
http://patchwork.ozlabs.org/patch/460252/

> I'm happy to apply one or the other as an interim step to make
> forward progress.  However, I really want to seriously consider
> our long term handling of fragments in netfilter meanwhile.

Here is a summary of the two.

Both allow removal of bridge netfilter mtu kludge that we have in
ip_fragment, but they have different pros and cons.

A. Add mtu argument to ip_fragment.
 pro: actual change is small (most is refactoring)
 pro: obvious code flow for bridge netfilter, we can
  pass device_mtu - reserved_mtu (encap overhead for ppp, vlan, etc)
  as 'new' device mtu, effectively moving this hack from ip_fragment to
  br_netfilter.

 cons: doesn't resolve hypothetical(?) router edge cases:
   #1. if we'd receive e.g. fragment size 1000 and fragment size
   200, where both fragments have DF bit set, we will currently
   send a single 1200 byte DF packet, if outdevice has mtu >= 1200.

   This is because ip_finish_output() only fragments if skb->len > mtu.

   #2 iff we do refragment (e.g. out mtu is 1000), we re-create fragments
   of size 1000 and 200, but DF bit gets lost in the process.

Neither seems to be a big problem, #2 doesn't break end-to-end connectivity,
and #1 doesn't seem to happen in practice; else we should have seen
bug reports by now.  However, I do feel uneasy about it and think we should
fix it.

last A series diffstat:
 include/linux/netfilter_bridge.h |    7 -------
 include/net/ip.h                 |    2 +-
 net/bridge/br_netfilter.c        |   34 +++++++++++++++++++++++++++++++---
 net/ipv4/ip_output.c             |   29 ++++++++++++++++++-----------
 4 files changed, 50 insertions(+), 22 deletions(-)

B. store largest fragment size in IPCB and use that.
  pro: we already do this to reject refragmented skbs that had DF set if
  the largest original fragment is larger than the outdevice mtu, so this
  change is not as big as one might think.

  The main aspects of this change is that ip_fragment will do

  mtu = IPCB(skb)->largest_frag_size;
  if (!mtu)
	mtu = ip_skb_dst_mtu(skb);

  instead of just calling ip_skb_dst_mtu(), and that we will call
  ip_fragment on all skbs that have new IPSKB_FRAG_PMTU flag set, i.e.

  if ((IPCB(skb)->flags & IPSKB_FRAG_PMTU) || skb->len > mtu)
	ip_fragment()

  This will prevent us from sending packets that are larger than
  the biggest original fragment we originally got.

  bridge netfilter can make use of this; since we store largest
  size seen we can just remove the mtu encap overhead calculations.

  For ipv6, bridge netfilter will have to be fixed
  to preserve IP6CB largest size, just like we do for IPCB.
  This will be a bridge netfilter specific change only.

last B series had this diffstat:
 include/net/inet_frag.h |    2 +-
 include/net/ip.h        |    1 +
 net/ipv4/ip_forward.c   |   18 +++++++++++-------
 net/ipv4/ip_fragment.c  |   31 ++++++++++++++++++++++++++-----
 net/ipv4/ip_output.c    |   37 ++++++++++++++++++++++++++++---------
 net/ipv6/ip6_output.c   |   29 ++++++++++++++++++-----------
 6 files changed, 85 insertions(+), 33 deletions(-)

My suggestion would be to go with #B.  Its a more intrusive change compared to A,
but it allows to get rid of br netfilter hack while also resolving silly and/or
rare corner cases, such as e.g. reassembling two ipv6 fragments and then failing
to undo it on forward.

Neither approach prevents us from making skb_has_frag_list() fast-path
in ip_fragment work again at a later point.

I can rebase and resubmit both proposals if you prefer so you can review
them in more detail.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html