[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO1wt+b4OkJ4DmJ7_hUj_oKj4p_rRMuKUVpiBY7MSFMa=0WKQg@mail.gmail.com>
Date: Tue, 15 Nov 2016 17:49:43 +0100
From: Vicente Jiménez <googuy@...il.com>
To: David Miller <davem@...emloft.net>
Cc: Alexey Kuznetsov <kuznet@....inr.ac.ru>,
James Morris <jmorris@...ei.org>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
Patrick McHardy <kaber@...sh.net>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] icmp: Restore resistence to abnormal messages
On Mon, Nov 14, 2016 at 7:36 PM, David Miller <davem@...emloft.net> wrote:
> From: Vicente Jimenez Aguilar <googuy@...il.com>
> Date: Fri, 11 Nov 2016 21:20:18 +0100
>
>> @@ -819,6 +820,12 @@ static bool icmp_unreach(struct sk_buff *skb)
>> /* fall through */
>> case 0:
>> info = ntohs(icmph->un.frag.mtu);
>> + /* Handle weird case where next hop MTU is
>> + * equal to or exceeding dropped packet size
>> + */
>> + old_mtu = ntohs(iph->tot_len);
>> + if (info >= old_mtu)
>> + info = old_mtu - 2;
>
> This isn't something the old code did.
>
> The old code behaved much differently.
>
I don't wanted to restore old behavior just fix a strange case that
was handle by this code where the next hop MTU reported by the router
is equal or greater than the actual path MTU. Because router
information is wrong, we need a way to guess a good packet size
ignoring router data. The simplest strategy that avoid odd numbers is
reducing dropped packet size by 2.
> In the case where the new mtu was smaller than 68 or larger than
> the iph->tot_len value, it would do several things:
Larger OR EQUAL (our case) than the iph->tot_len.
>
> 1) First it would check for a BSD 4.2 anomaly and subtract old_mtu
> by the IP header length.
This only happens when next hop MTU is zero. I don't care about this
case. It is handled at protocol level as the commit that removed that
function says.
>
> 2) Second, it would try to guess the intended MTU using the
> mtu_plateau table.
>
Old kernel work because they reduced the current MTU by a small amount
using the mtu_plateau.
Not because we have some trouble with headers and we need the common 1492 value.
> I don't see any code where a subtraction by a fixed constant of 2
> occurred.
>
This is the simplest solution I came about. Because packets of size
old_mtu get dropped and suggested next hop MTU is equal or larger, we
just ignore wrong router information and try to get the correct MTU by
reducing current MTU by just a bit. I discarded subtracting by 1 to
avoid odd numbers.
For our case:
- Old kernel sets MTU to 1492 from previous 1500 bytes and worked.
- Newer kernel enters infinite loop sending dropped size packets.
- Patched kernel sets MTU to 1498 from 1500 and go trough this abnormal router.
- Microsoft Windows also recovers from those strange ICMP messages.
By the way, setting MTU to 1499 also failed to communicate for us.
> Nor can I figure out what that might accomplish. If you really
> want to do this, you have to docuement what this 2 means, what
> it is accomplishing, and why you have choosen to accomplish it
> this way.
>
> Thanks.
I just wanted to restore the "new_mtu >= old_mtu" check and instead of
reducing the current packet size using a plateau function, decrement
current size by 2 (to avoid odd numbers). This solution solved our
problem setting MTU to 1498 and I think this must converge to any good
MTU value because it reduce packet size until we don't receive any
more ICMP fragmentation needed messages.
We can criticize that this solution could be slow to reach small MTU values.
But I suspect that those abnormal ICMP messages was caused by a router
bug and must be very close to a size that get through them. If
suggested MTU is smaller than current packet size, we set the MTU to
next hop MTU. If those packets that are equal in size to the suggested
one still get dropped, then this code try to overcome that. It had the
strength of being really simple and it must be ideally never be
executed (we are still trying to fix the abnormal router, but we don't
manage it). I suspect that for those weird case, a working size must
be very close to the suggested and it is far better than current
kernel that just take the value in next hop MTU without any check and
enter in a infinite loop sending dropped size packets. TCP connections
stall and finally time out.
I suspect that a router bug get size comparison wrong and the real
path MTU is 1500 (as it is in the other direction). We need to get
around this and currently we just lowered interface MTU at startup for
those devices.
Where do I need to put more explanation?
- Code comment
- Commit message
Attached capture file that show the problem: Abnormal ICMP messages
that drops and suggest 1500 bytes.
This is the long explanation that I emailed earlier:
"
In a large corporate network, we spotted this weird ICMP message after
a long troubleshooting. See attached capture file. Those ICMP "network
unreachable - fragmentation needed and don't fragment bit set"
messages are sent by a router that drop 1500 bytes IP packets and fill
the next hop MTU ICMP field with 1500.
Those messages cause the TCP connection to stall but only on newer
kernels. Older kernels set path MTU to 1492 and communicates
successfully.
After checking code and commit history, I spotted how commit
46517008e116 ("ipv4: Kill ip_rt_frag_needed().") from June 2012
changed ICMP messages handling by removing ip_rt_frag_needed function.
The relevant part of the ip_rt_frag_needed function that was removed is:
if (new_mtu < 68 || new_mtu >= old_mtu) {
/* BSD 4.2 derived systems incorrectly adjust
* tot_len by the IP header length, and report
* a zero MTU in the ICMP message.
*/
if (mtu == 0 &&
old_mtu >= 68 + (iph->ihl << 2))
old_mtu -= iph->ihl << 2;
mtu = guess_mtu(old_mtu);
}
This condition handled the cases when next hop MTU where zero (less
than 68). Now this is handled by the protocol and fixed by commit
68b7107b6298 "ipv4: icmp: Fix pMTU handling for rare case".
But the rarest case when (next hop MTU) new_mtu >= old_mtu (dropped
packet length) was also removed. This commit restores this check.
Instead of using a table lookup like function guess_mtu uses, it just
try to set the path MTU decrementing by 2 bytes the dropped packet
size.
In our case, setting the path MTU to just 1498 (one iteration) worked.
This solution should converge in any case to a good value by small
steps. I don't think there's a need to a more complex solution.
The patched kernel worked perfectly setting the path MTU to 1498 from
the initial default interface value of 1500. This time I don't have a
capture file from inside the affected center, but all received packed
had a maximum size of 1498.
"
--
thanks
vicente
Download attachment "ICMP dropping and suggesting 1500.pcapng" of type "application/x-pcapng" (89664 bytes)
Powered by blists - more mailing lists