[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210719092348.wzpnhcrib24m7zpw@skbuf>
Date: Mon, 19 Jul 2021 09:23:49 +0000
From: Ioana Ciornei <ioana.ciornei@....com>
To: Vladimir Oltean <vladimir.oltean@....com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Andrew Lunn <andrew@...n.ch>,
Florian Fainelli <f.fainelli@...il.com>,
Vivien Didelot <vivien.didelot@...il.com>,
Jiri Pirko <jiri@...nulli.us>,
Ido Schimmel <idosch@...sch.org>,
Tobias Waldekranz <tobias@...dekranz.com>,
Roopa Prabhu <roopa@...dia.com>,
Nikolay Aleksandrov <nikolay@...dia.com>,
Stephen Hemminger <stephen@...workplumber.org>,
"bridge@...ts.linux-foundation.org"
<bridge@...ts.linux-foundation.org>,
Grygorii Strashko <grygorii.strashko@...com>,
Marek Behun <kabel@...ckhole.sk>,
DENG Qingfang <dqfext@...il.com>,
Vadym Kochan <vkochan@...vell.com>,
Taras Chornyi <tchornyi@...vell.com>,
Lars Povlsen <lars.povlsen@...rochip.com>,
Steen Hegelund <Steen.Hegelund@...rochip.com>,
"UNGLinuxDriver@...rochip.com" <UNGLinuxDriver@...rochip.com>,
Claudiu Manoil <claudiu.manoil@....com>,
Alexandre Belloni <alexandre.belloni@...tlin.com>
Subject: Re: [PATCH v4 net-next 09/15] net: bridge: switchdev: let drivers
inform which bridge ports are offloaded
On Mon, Jul 19, 2021 at 12:44:28AM +0300, Vladimir Oltean wrote:
> On reception of an skb, the bridge checks if it was marked as 'already
> forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
> is, it assigns the source hardware domain of that skb based on the
> hardware domain of the ingress port. Then during forwarding, it enforces
> that the egress port must have a different hardware domain than the
> ingress one (this is done in nbp_switchdev_allowed_egress).
>
> Non-switchdev drivers don't report any physical switch id (neither
> through devlink nor .ndo_get_port_parent_id), therefore the bridge
> assigns them a hardware domain of 0, and packets coming from them will
> always have skb->offload_fwd_mark = 0. So there aren't any restrictions.
>
> Problems appear due to the fact that DSA would like to perform software
> fallback for bonding and team interfaces that the physical switch cannot
> offload.
>
> +-- br0 ---+
> / / | \
> / / | \
> / | | bond0
> / | | / \
> swp0 swp1 swp2 swp3 swp4
>
> There, it is desirable that the presence of swp3 and swp4 under a
> non-offloaded LAG does not preclude us from doing hardware bridging
> beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high
> enough that software bridging between {swp0,swp1,swp2} and bond0 is not
> impractical.
>
> But this creates an impossible paradox given the current way in which
> port hardware domains are assigned. When the driver receives a packet
> from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to
> something.
>
> - If we set it to 0, then the bridge will forward it towards swp1, swp2
> and bond0. But the switch has already forwarded it towards swp1 and
> swp2 (not to bond0, remember, that isn't offloaded, so as far as the
> switch is concerned, ports swp3 and swp4 are not looking up the FDB,
> and the entire bond0 is a destination that is strictly behind the
> CPU). But we don't want duplicated traffic towards swp1 and swp2, so
> it's not ok to set skb->offload_fwd_mark = 0.
>
> - If we set it to 1, then the bridge will not forward the skb towards
> the ports with the same switchdev mark, i.e. not to swp1, swp2 and
> bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should
> have forwarded the skb there.
>
> So the real issue is that bond0 will be assigned the same hardware
> domain as {swp0,swp1,swp2}, because the function that assigns hardware
> domains to bridge ports, nbp_switchdev_add(), recurses through bond0's
> lower interfaces until it finds something that implements devlink (calls
> dev_get_port_parent_id with bool recurse = true). This is a problem
> because the fact that bond0 can be offloaded by swp3 and swp4 in our
> example is merely an assumption.
>
> A solution is to give the bridge explicit hints as to what hardware
> domain it should use for each port.
>
> Currently, the bridging offload is very 'silent': a driver registers a
> netdevice notifier, which is put on the netns's notifier chain, and
> which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
> bridge, and the lower is an interface it knows about (one registered by
> this driver, normally). Then, from within that notifier, it does a bunch
> of stuff behind the bridge's back, without the bridge necessarily
> knowing that there's somebody offloading that port. It looks like this:
>
> ip link set swp0 master br0
> |
> v
> br_add_if() calls netdev_master_upper_dev_link()
> |
> v
> call_netdevice_notifiers
> |
> v
> dsa_slave_netdevice_event
> |
> v
> oh, hey! it's for me!
> |
> v
> .port_bridge_join
>
> What we do to solve the conundrum is to be less silent, and change the
> switchdev drivers to present themselves to the bridge. Something like this:
>
> ip link set swp0 master br0
> |
> v
> br_add_if() calls netdev_master_upper_dev_link()
> |
> v bridge: Aye! I'll use this
> call_netdevice_notifiers ^ ppid as the
> | | hardware domain for
> v | this port, and zero
> dsa_slave_netdevice_event | if I got nothing.
> | |
> v |
> oh, hey! it's for me! |
> | |
> v |
> .port_bridge_join |
> | |
> +------------------------+
> switchdev_bridge_port_offload(swp0, swp0)
>
> Then stacked interfaces (like bond0 on top of swp3/swp4) would be
> treated differently in DSA, depending on whether we can or cannot
> offload them.
>
> The offload case:
>
> ip link set bond0 master br0
> |
> v
> br_add_if() calls netdev_master_upper_dev_link()
> |
> v bridge: Aye! I'll use this
> call_netdevice_notifiers ^ ppid as the
> | | switchdev mark for
> v | bond0.
> dsa_slave_netdevice_event | Coincidentally (or not),
> | | bond0 and swp0, swp1, swp2
> v | all have the same switchdev
> hmm, it's not quite for me, | mark now, since the ASIC
> but my driver has already | is able to forward towards
> called .port_lag_join | all these ports in hw.
> for it, because I have |
> a port with dp->lag_dev == bond0. |
> | |
> v |
> .port_bridge_join |
> for swp3 and swp4 |
> | |
> +------------------------+
> switchdev_bridge_port_offload(bond0, swp3)
> switchdev_bridge_port_offload(bond0, swp4)
>
> And the non-offload case:
>
> ip link set bond0 master br0
> |
> v
> br_add_if() calls netdev_master_upper_dev_link()
> |
> v bridge waiting:
> call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload
> | | wasn't called, okay, I'll use a
> v | hwdom of zero for this one.
> dsa_slave_netdevice_event : Then packets received on swp0 will
> | : not be software-forwarded towards
> v : swp1, but they will towards bond0.
> it's not for me, but
> bond0 is an upper of swp3
> and swp4, but their dp->lag_dev
> is NULL because they couldn't
> offload it.
>
> Basically we can draw the conclusion that the lowers of a bridge port
> can come and go, so depending on the configuration of lowers for a
> bridge port, it can dynamically toggle between offloaded and unoffloaded.
> Therefore, we need an equivalent switchdev_bridge_port_unoffload too.
>
> This patch changes the way any switchdev driver interacts with the
> bridge. From now on, everybody needs to call switchdev_bridge_port_offload
> and switchdev_bridge_port_unoffload, otherwise the bridge will treat the
> port as non-offloaded and allow software flooding to other ports from
> the same ASIC.
>
> Note that these functions lay the ground for a more complex handshake
> between switchdev drivers and the bridge in the future. During the
> info->linking == false path, switchdev_bridge_port_unoffload() is
> strategically put in the NETDEV_PRECHANGEUPPER notifier as opposed to
> NETDEV_CHANGEUPPER. The reason for this has to do with a future
> migration of the switchdev object replay helpers (br_*_replay) from a
> pull mode (completely initiated by the driver) to a semi-push mode (the
> bridge initiates the replay when the switchdev driver declares that it
> offloads a port). On deletion, the switchdev object replay helpers need
> the netdev adjacency lists to be valid, and that is only true in
> NETDEV_PRECHANGEUPPER. So we need to add trivial glue code to all
> drivers to handle a "pre bridge leave" event, and that is where we hook
> the switchdev_bridge_port_unoffload() call.
>
> Cc: Vadym Kochan <vkochan@...vell.com>
> Cc: Taras Chornyi <tchornyi@...vell.com>
> Cc: Ioana Ciornei <ioana.ciornei@....com>
> Cc: Lars Povlsen <lars.povlsen@...rochip.com>
> Cc: Steen Hegelund <Steen.Hegelund@...rochip.com>
> Cc: UNGLinuxDriver@...rochip.com
> Cc: Claudiu Manoil <claudiu.manoil@....com>
> Cc: Alexandre Belloni <alexandre.belloni@...tlin.com>
> Cc: Grygorii Strashko <grygorii.strashko@...com>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@....com>
Tested-by: Ioana Ciornei <ioana.ciornei@....com> # dpaa2-switch: regression
Acked-by: Ioana Ciornei <ioana.ciornei@....com> # dpaa2-switch
Powered by blists - more mailing lists