netdev - Re: Bug? GRE tunnel periodically won't transmit some packets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20111110051649.505C8362D2@apps0.cs.toronto.edu>
Date:	Thu, 10 Nov 2011 00:16:49 -0500
From:	Chris Siebenmann <cks@...toronto.edu>
To:	Eric Dumazet <eric.dumazet@...il.com>
cc:	Chris Siebenmann <cks@...toronto.edu>, netdev@...r.kernel.org
Subject: Re: Bug? GRE tunnel periodically won't transmit some packets

| So it appears the drop is in gre xmit because frame is bigger than
| mtu...
| 
| Maybe you receive some strange ICMP (ICMP_FRAG_NEEDED) from a buggy
| host ?
| 
| You could catch it with "tcpdump -s 1000 -i any icmp" maybe...

 The problem went away for several days and then came roaring back
just now. I couldn't see any outside ICMP messages like this while
the problem was happening. I did see ICMP messages, but they were
locally generated:

	IP 128.100.3.52 > 128.100.3.52: ICMP 128.100.3.51 unreachable - need to frag (mtu 478), length 556

(I verified that these were for the things that were stalling with
'tcpdump -vv -ee'.)

At the time that this was happening, I could see a lot of 'ip route show
table cache' entries like this:

	128.100.3.58 from 66.96.18.208 dev ppp0  src 66.96.18.208
	    cache  expires 286sec ipid 0xdeda mtu 552 rtt 44ms rttvar 30ms ssthresh 7 cwnd 9 iif lo

There were a bunch of other 'mtu 552' routes. Flushing the routing cache
(with 'ip route flush cache; ip route show table cache' to verify that
it had flushed) did not change the situation; the problem continued and
the 'mtu 552' routes came back as fast as I could check (it appeared to
be the moment that the routing cache was repopulated, there they were).

 In general the route cache appears to have strangely low MTUs listed
for the remote end of the tunnel (at least after the problem's
happened), eg:

	128.100.3.58 from 66.96.18.208 dev ppp0 
	    cache  expires 203sec ipid 0xdeda mtu 934 rtt 44ms rttvar 30ms ssthresh 7 cwnd 9
	128.100.3.58 from 66.96.18.208 dev ppp0  src 66.96.18.208 
	    cache  expires 203sec ipid 0xdeda mtu 934 rtt 44ms rttvar 30ms ssthresh 7 cwnd 9 iif lo

I would normally expect this to be 1492, the PPP link MTU.

 This does not seem to happen on the Fedora 14 kernel (the one where
the problem doesn't happen). There the route is listed as:

	128.100.3.58 from 66.96.18.208 dev ppp0 
	    cache  mtu 1492 advmss 1452 hoplimit 64
	128.100.3.58 from 66.96.18.208 dev ppp0 
	    cache  mtu 1492 advmss 1452 hoplimit 64

(to quote what I *think* are the relevant entries.)

 Since things may be pointing towards routing cache oddities, I should
mention that I have a somewhat peculiar policy based routing setup
involving this GRE tunnel. While it's been working fine for literally
years, it's possible that recent changes in the area changed something
so that peculiar things now happen; if you think it might be relevant, I
can provide a dump of the routing rules and explain the setup.

(Part of why this occurs to me now is that I know it's possible to have
a connection to 128.100.3.58 (the remote end of the tunnel) that runs
through the tunnel itself, and I've seen such routes appear in the 'ip
route show table cache' output.)

		- cks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html