[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250211152353.5szhbv7kdlf6bca3@skbuf>
Date: Tue, 11 Feb 2025 17:23:53 +0200
From: Vladimir Oltean <olteanv@...il.com>
To: Ido Schimmel <idosch@...dia.com>
Cc: Eric Woudstra <ericwouds@...il.com>, Petr Machata <petrm@...dia.com>,
Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Nikolay Aleksandrov <razor@...ckwall.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, amcohen@...dia.com
Subject: Re: [RFC PATCH v1 net-next] net: mlxsw_sp: Use
switchdev_handle_port_obj_add_foreign() for vxlan
On Tue, Feb 11, 2025 at 04:31:43PM +0200, Ido Schimmel wrote:
> On Mon, Feb 10, 2025 at 05:22:46PM +0200, Vladimir Oltean wrote:
> > On Mon, Feb 10, 2025 at 08:48:32AM +0200, Ido Schimmel wrote:
> > > On Sat, Feb 08, 2025 at 03:15:18PM +0100, Eric Woudstra wrote:
> > > > Sending as RFC as I do not own this hardware. This code is not tested.
> > > >
> > > > Vladimir found this part of the spectrum switchdev, while looking at
> > > > another issue here:
> > > >
> > > > https://lore.kernel.org/all/20250207220408.zipucrmm2yafj4wu@skbuf/
> > > >
> > > > As vxlan seems a foreign port, wouldn't it be better to use
> > > > switchdev_handle_port_obj_add_foreign() ?
> > >
> > > Thanks for the patch, but the VXLAN port is not foreign to the other
> > > switch ports. That is, forwarding between these ports and VXLAN happens
> > > in hardware. And yes, switchdev_bridge_port_offload() does need to be
> > > called for the VXLAN port so that it's assigned the same hardware domain
> > > as the other ports.
> >
> > Thanks, this is useful. I'm not providing a patch yet because there are
> > still things I don't understand.
> >
> > Have you seen any of the typical problems associated with the software
> > bridge thinking vxlan isn't part of the same hwdom as the ingress
> > physical port, and, say, flooding packets twice to vxlan, when the
> > switch had already forwarded a copy of the packet? In almost 4 years
> > since commit 2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform
> > which bridge ports are offloaded"), I would have expected such issues
> > would have been noticed?
>
> I'm aware of one report from QA that is on my list. They configured a
> VXLAN tunnel that floods packets to two remote VTEPs:
>
> 00:00:00:00:00:00 dev vx100 dst 1.1.1.2
> 00:00:00:00:00:00 dev vx100 dst 1.1.1.3
>
> The underlay routes used to forward the VXLAN encapsulated traffic are:
>
> 1.1.1.2 via 10.0.0.2 dev swp13
> 1.1.1.3 via 10.0.0.6 dev swp15
>
> But they made sure not to configure 10.0.0.6 at the other end. What
> happens is that traffic for 1.1.1.2 is correctly forwarded in hardware,
> but traffic for 1.1.1.3 encounters a neighbour miss and trapped to
> the CPU which then forwards it again to 1.1.1.2.
>
> Putting the VXLAN device in the same hardware domain as the other switch
> ports should solve this "double forwarding" scenario.
Let me see if I understand what you are saying based on the code (please
bear with me, VXLAN tunnels are way outside of my area).
So in this case, the packet hits a neighbor miss in the VXLAN underlay
and reaches the CPU through the
MLXSW_SP_TRAP_EXCEPTION(UNRESOLVED_NEIGH, L3_EXCEPTIONS) trap. That is
defined to call mlxsw_sp_rx_mark_listener(), which sets
skb->offload_fwd_mark = 1 (aka packet was forwarded in L2) but not
skb->offload_l3_fwd_mark (aka packet was not forwarded in L3).
This corresponds to expected reality: neighbor miss packets should not
re-enter the bridge forwarding path in the overlay towards the VXLAN.
Yet they still do, because although skb->offload_fwd_mark is correctly
set, it only skips software forwarding towards other physical (and/or
LAG) mlxsw bridge ports.
If my understanding is correct, then yes, I agree that making the hwdomain
of the vxlan tunnel coincide with that of the other bridge ports should
suppress this packet.
> > Do we require a Fixes: tag for this?
>
> It's not strictly a regression (never worked) and it's not that critical
> IMO, so I prefer targeting net-next.
Yes, I agree. Before that commit, the bridge would look by itself at the
bridge port, recursively searching for lower interfaces until something
returned something positively in dev_get_port_parent_id(). Which was a
limiting model. If anything, solving the "double forwarding of vxlan exception
packets" issue would have required an alternative switchdev bridge port
offloading model anyway, like the explicit one that exists now, which is
more flexible.
> > And then, switchdev_bridge_port_offload() has a brport_dev argument,
> > which would pretty clearly be passed as vxlan_dev by
> > mlxsw_sp_bridge_8021d_vxlan_join() and
> > mlxsw_sp_bridge_vlan_aware_vxlan_join(), but it also has one other
> > "struct net_device *dev" argument, on which br_switchdev_port_offload()
> > wants to call dev_get_port_parent_id(), to see what hwdom (what other
> > bridge ports) to associate it to.
>
> Right.
>
> > Usually we use the mlxsw_sp_port->dev as the second argument, but which
> > port to use here? Any random port that's under the bridge, or is there a
> > specific one for the vxlan that should be used?
>
> Any random port is fine as they all share the same parent ID.
>
> BTW, I asked Amit (Cced) to look into this as there might be some issues
> with ARP suppression that will make it a bit more complicated to patch.
> Are you OK with that or do you want the authorship? We can put any tag
> you choose. I am asking so that it won't appear like I am trying to
> discredit you.
It is completely fine for Amit to take a look at this, I could really do
without yet another thing that is completely over my head :)
Just one indication from my side on how I would approach things:
Because spectrum bridge ports come and go, and the vxlan tunnel bridge
port may stay, its hwdom may change over time (depending on how many
cards there are plugged in the system). Also, it probably won't work to
combine spectrum bridge ports spanning multiple cards in the same
bridge. For those reasons, br_switchdev_port_offload() supports multiple
calls made to the same vxlan_dev, and behind the scenes, it will just
bump the net_bridge_port->offload_count.
Thinking a bit more, I think you want exactly that: a vxlan bridge port
needs to have an offload_count equal to the number of other spectrum
ports in the same bridge. This way, it only stops being offloaded when
there are no spectrum ports left (and the offload_count drops to 0).
But this needs handling from both directions: when a vxlan joins a
bridge with spectrum ports, and when a spectrum port joins a bridge with
vxlan tunnels. Plus every leave operation having to call
br_switchdev_port_unoffload(), for things to stay balanced.
Probably there are tons of other restrictions to be aware of, but I just
wanted to say this, because in the previous email we talked about "just
any random port" and it seems like it actually needs to be "all ports".
Powered by blists - more mailing lists