[<prev] [next>] [day] [month] [year] [list]
Message-ID: <225a9a92c82a4654b10cb8db68abfc3a@AcuMS.aculab.com>
Date: Wed, 19 Jul 2023 12:30:24 +0000
From: David Laight <David.Laight@...LAB.COM>
To: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski
<kuba@...nel.org>
Subject: Unexpected ICMP errors for UDP packets to 127.0.0.1
We are seeing an application (running on AWS) failing because in
is receiving an ICMP error indication for a UDP packet to 127.0.0.1.
It is very rare - 3 instances since the end of January 2023.
No errors were reported in the previous 6 month with much the
same workload (or for several years without ipsec).
So we suspect a kernel change between (about) 5.10.147 and 5.10.162
(the last fail was with 5.10.184).
The loopback UDP sockets are created at startup and are never closed.
The sender is an IPv6 socket bound to ::, the receiver a connected
IPv4 socket.
(All the traffic is actually IPv4.)
The sender does a recvmsg(... MSG_ERRQUEUE) and gets a
SO_EE_ORIGIN_ICMP indication (we don't know which type!).
AFAICT this is only generated for a received ICMP message
(ie nothing in the transmit path can generate it).
The receiving socket is still there, it later reports ECONNREFUSED
as a consequence of the sender closing its socket.
We think the trigger is changes to the ipsec config (changes the
xfrm tables) for some tunnels on eth0.
Somewhere this must be causing a transient error in the routing.
There are 10-20 ipsec connections with a lifetime of hours.
There are 100s of other UDP sockets and lots of 'host unreachable'
indications being sent.
So either the kernel decides it can deliver a packet to 127.0.0.1
received from lo0 or the udp socket lookup fails.
Could the latter happen (somehow) if the 'dst' address on the
socket is somehow different to the one in the skb?
Perhaps due to the effects of rcu updates?
Early dmux is enabled, and there were some associated changes
prior to 5.10.162 - but they don't look problematic.
There are also some additional checks in the fib lookup code
and in the xfrm code, I don't know that code at all.
But AFAICT the xfrm code causes silent discards - not icmp.
Any ideas as to what to look for?
This is a live system on AWS - we can't use a test kernel.
There is also a lot of UDP traffic (it is processing RTP audio).
So options are rather limited.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists