[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADvbK_eZp5ikahxH4wvPm5_PuK1khvVKpGnY5LUd9nwHgS96Cw@mail.gmail.com>
Date: Mon, 17 Feb 2025 17:31:16 -0500
From: Xin Long <lucien.xin@...il.com>
To: Ido Schimmel <idosch@...sch.org>
Cc: Stephen Hemminger <stephen@...workplumber.org>, edumazet@...gle.com, netdev@...r.kernel.org,
fmei@....com, Wei Wang <weiwan@...gle.com>
Subject: Re: Fw: [Bug 219766] New: Garbage Ethernet Frames
On Sat, Feb 15, 2025 at 3:47 PM Ido Schimmel <idosch@...sch.org> wrote:
>
> On Mon, Feb 10, 2025 at 08:49:31AM -0800, Stephen Hemminger wrote:
> > Not really enough information to do any deep analysis but forwarding to netdev
> > anyway as it is not junk.
> >
> > Begin forwarded message:
> >
> > Date: Sun, 09 Feb 2025 12:24:32 +0000
> > From: bugzilla-daemon@...nel.org
> > To: stephen@...workplumber.org
> > Subject: [Bug 219766] New: Garbage Ethernet Frames
> >
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=219766
> >
> > Bug ID: 219766
> > Summary: Garbage Ethernet Frames
> > Product: Networking
> > Version: 2.5
> > Hardware: All
> > OS: Linux
> > Status: NEW
> > Severity: normal
> > Priority: P3
> > Component: Other
> > Assignee: stephen@...workplumber.org
> > Reporter: fmei@....com
> > Regression: No
> >
> > I am currently troubleshooting a very strange problem which appears when
> > upgrading Kernel 6.6.58 to 6.6.60. The kernel version change is part of a
> > change of talos linux (www.talos.dev) from 1.8.2 to 1.8.3.
> >
> > We are running this machines at hetzner - a company which is providing server
> > hosting. they complain that we are using mac addresses which are not allowed
> > (are not the mac addresses of the physical nic)
> >
> > In the investigation of the problem I did tcpdumps on the physical adapters and
> > captured this suspicious ethernet frames. The frames do neither have a known
> > ethertype, nor do they have a mac address of a known vendor or a known virtual
> > mac address range. They seem garbage to me. Below an example. More can be found
> > in the github issue. This frames are not emitted very often and the systems are
> > operating normally. If I would not be informed by the hosting provider I would
> > not have noticed it at all.
> >
> > I also tried to track it down to a specific hardware (r8169), but we have the
> > same problem with e1000e.
> >
> > I checked the changelogs of the two kernel versions (6.6.59 & 6.6.60) and
> > noticed there were some changes which could be the problem, but I simply do not
> > have the experience for it.
> >
> > Can anybody check the changelog of the 2 versions and see if there is a change
> > which might cause the problem? Can anybody give me a hint how to track it down
> > further?
> >
> > tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> > listening on enp9s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 22:07:02.329668 20:00:40:11:18:fb > 45:00:00:44:f4:94, ethertype Unknown
> > (0x58c6), length 68:
> > 0x0000: 8dda 74ca f1ae ca6c ca6c 0098 969c 0400 ..t....l.l......
> > 0x0010: 0000 4730 3f18 6800 0000 0000 0000 9971 ..G0?.h........q
> > 0x0020: c4c9 9055 a157 0a70 9ead bf83 38ca ab38 ...U.W.p....8..8
> > 0x0030: 8add ab96 e052 .....R
> >
> >
> > Issue with more information: https://github.com/siderolabs/talos/issues/9837
>
> Adding Xin and Eric as I believe this is caused by commit 22600596b675
> ("ipv4: give an IPv4 dev to blackhole_netdev"). It's present in v6.6.59,
> but not in v6.6.58 which fits the report.
>
> The Ethernet headers of all the captured packets start with 0x45, so
> it's quite apparent that these are IPv4 packets that are transmitted
> without an Ethernet header. Specifically, these seem to be UDP packets
> (10th byte is always 0x11).
>
> After 22600596b675 neighbours can be constructed on the blackhole device
> and they are constructed with an output function (neigh_direct_output())
> that simply calls dev_queue_xmit(). The latter will transmit packets via
> skb->dev which might not be the blackhole device if dst_dev_put()
> switched dst->dev to the blackhole device when another CPU was using
> this dst entry in ip_output().
>
> I verified this using these hackish scripts [1][2]. They create a
> nexthop exception where a dst entry is cached. Two UDP applications use
> this dst entry, but a background process continuously bumps the IPv4
> generation ID so as to force these applications to perform a route
> lookup instead of using the dst entry that they previously resolved.
>
> After creating a new dst entry, one application will try to cache it in
> the nexthop exception by calling rt_bind_exception() which will call
> dst_dev_put() on the previously cached dst entry that might still be
> used by the other application on a different CPU.
>
> There's a counter at the end of the test that counts the number of
> packets that were transmitted without an Ethernet header. The problem
> does not reproduce with 22600596b675 reverted.
>
> One possible solution is to add the dst entry to the uncached list by
> converting dst_dev_put() to rt_add_uncached_list() [3]. The dst entry
> will be eventually removed from this list when the dst entry is
> destroyed or when dst->dev is unregistered. However, I am not sure it
> will solve this report as rt_bind_exception() is not the only caller of
> dst_dev_put().
With dst_dev_put(), the next dst_ops->check() call will fail. Sending a
packet on a socket typically performs routing only if sk_dst_check() fails.
I'm concerned that using rt_add_uncached_list() might delay other sockets
from acquiring the new dst as quickly as before.
(Adding Wei Wang, who introduced dst_dev_put())
>
> Another possible solution is to have the blackhole device consume the
> packets instead of letting them go out without an Ethernet header [4].
> Doesn't seem great as the packets disappear without telling anyone
> (before 22600596b675 an error was returned).
This looks fine to me. The fix in commit 22600596b675 was specifically
intended to prevent an error from being returned in these cases, as it
would break userspace UDP applications.
If you prefer to avoid silent drops, you could add a warning like:
net_warn_ratelimited("%s(): Dropping skb.\n", __func__);
similar to how blackhole_netdev_xmit() handles it.
Thanks.
>
> Let me know what you think. Especially if you have a better fix.
>
> Thanks
>
> [1]
> #!/bin/bash
> # blackhole.sh
>
> for ns in client router server; do
> ip netns add $ns
> ip -n $ns link set dev lo up
> done
>
> ip -n client link add name veth0 type veth peer name veth1 netns router
> ip -n router link add name veth2 type veth peer name veth3 netns server
>
> # Client
> ip -n client link set dev veth0 up
> ip -n client address add 192.0.2.1/28 dev veth0
> ip -n client route add default via 192.0.2.2
> tc -n client qdisc add dev veth0 clsact
> tc -n client filter add dev veth0 egress pref 1 proto all \
> flower dst_mac 45:00:00:00:00:00/ff:ff:00:00:00:00 action pass
>
> # Router
> ip netns exec router sysctl -wq net.ipv4.conf.all.forwarding=1
> ip -n router link set dev veth1 up
> ip -n router address add 192.0.2.2/28 dev veth1
> ip -n router link set dev veth2 up mtu 1300
> ip -n router address add 192.0.2.17/28 dev veth2
>
> # Server
> ip -n server link set dev veth3 up mtu 1300
> ip -n server address add 192.0.2.18/28 dev veth3
> ip -n server route add default via 192.0.2.17
>
> sleep 1
>
> # Create a nexthop exception where the dst entry will be cached.
> ip netns exec client ping 192.0.2.18 -c 1 -M want -s 1400 -q &> /dev/null
>
> # Continuously invalidate cached dst entries by bumping the IPv4 generation ID.
> # When replacing the cached dst entry in the nexthop exception,
> # rt_bind_exception() will call dst_dev_put(), thereby setting the dst entry's
> # device to the blackhole device.
> ./bump_genid.sh &
>
> ip netns exec server iperf3 -s -p 6000 -D
> ip netns exec server iperf3 -s -p 7000 -D
> ip netns exec client iperf3 -c 192.0.2.18 -p 6000 -u -b 0 -t 60 &> /dev/null &
> ip netns exec client iperf3 -c 192.0.2.18 -p 7000 -u -b 0 -t 60 &> /dev/null
>
> tc -n client -s filter show dev veth0 egress
>
> ip netns pids server | xargs kill
> pkill bump_genid.sh
> ip -all netns delete
>
> [2]
> #!/bin/bash
> # bump_genid.sh
>
> while true; do
> ip -n client route add 198.51.100.1/32 dev lo
> ip -n client route del 198.51.100.1/32 dev lo
> done
>
> [3]
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 753704f75b2c..f40f860c0bec 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1461,7 +1461,7 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
> dst_hold(&rt->dst);
> rcu_assign_pointer(*porig, rt);
> if (orig) {
> - dst_dev_put(&orig->dst);
> + rt_add_uncached_list(orig);
> dst_release(&orig->dst);
> }
> ret = true;
>
> [4]
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index c8840c3b9a1b..448f352c3c92 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -244,8 +244,22 @@ static netdev_tx_t blackhole_netdev_xmit(struct sk_buff *skb,
> return NETDEV_TX_OK;
> }
>
> +static int blackhole_neigh_output(struct neighbour *n, struct sk_buff * skb)
> +{
> + kfree_skb(skb);
> + return 0;
> +}
> +
> +static int blackhole_neigh_construct(struct net_device *dev,
> + struct neighbour *n)
> +{
> + n->output = blackhole_neigh_output;
> + return 0;
> +}
> +
> static const struct net_device_ops blackhole_netdev_ops = {
> .ndo_start_xmit = blackhole_netdev_xmit,
> + .ndo_neigh_construct = blackhole_neigh_construct,
> };
>
> /* This is a dst-dummy device used specifically for invalidated
Powered by blists - more mailing lists