netdev - Re: [PATCH net-next 4/4] net: dsa: don't set skb->offload_fwd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210215171947.lvn44gb2rkeozlv7@skbuf>
Date:   Mon, 15 Feb 2021 19:19:47 +0200
From:   Vladimir Oltean <olteanv@...il.com>
To:     George McCollister <george.mccollister@...il.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        Florian Fainelli <f.fainelli@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Woojung Huh <woojung.huh@...rochip.com>,
        UNGLinuxDriver@...rochip.com, Sean Wang <sean.wang@...iatek.com>,
        Landen Chao <Landen.Chao@...iatek.com>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Alexandre Belloni <alexandre.belloni@...tlin.com>,
        Tobias Waldekranz <tobias@...dekranz.com>,
        DENG Qingfang <dqfext@...il.com>,
        Linus Walleij <linus.walleij@...aro.org>,
        Hauke Mehrtens <hauke@...ke-m.de>,
        Rasmus Villemoes <rasmus.villemoes@...vas.dk>,
        Oleksij Rempel <linux@...pel-privat.de>
Subject: Re: [PATCH net-next 4/4] net: dsa: don't set skb->offload_fwd_mark
 when not offloading the bridge

Hi George,

On Mon, Feb 15, 2021 at 09:48:38AM -0600, George McCollister wrote:
> On Sun, Feb 14, 2021 at 9:54 AM Vladimir Oltean <olteanv@...il.com> wrote:
> [snip]
> > diff --git a/net/dsa/tag_xrs700x.c b/net/dsa/tag_xrs700x.c
> > index 858cdf9d2913..215ecceea89e 100644
> > --- a/net/dsa/tag_xrs700x.c
> > +++ b/net/dsa/tag_xrs700x.c
> > @@ -45,8 +45,7 @@ static struct sk_buff *xrs700x_rcv(struct sk_buff *skb, struct net_device *dev,
> >         if (pskb_trim_rcsum(skb, skb->len - 1))
> >                 return NULL;
> >
> > -       /* Frame is forwarded by hardware, don't forward in software. */
> > -       skb->offload_fwd_mark = 1;
> > +       dsa_default_offload_fwd_mark(skb);
>
> Does it make sense that the following would have worked prior to this
> change? Is this only an issue for bridging between DSA ports when
> offloading is supported? lan0 is a port an an xrs700x switch:
>
> ip link set eth0 up
> ip link del veth0
> ip link add veth0 type veth peer name veth1
>
> for eth in veth0 veth1 lan1; do
>     ip link set ${eth} up
> done
> ip link add br0 type bridge
> ip link set veth1 master br0
> ip link set lan1 master br0
> ip link set br0 up
>
> ip addr add 192.168.2.1/24 dev veth0
>
> # ping host connected to physical LAN that lan0 is on
> ping 192.168.2.249 (works!)
>
> I was trying to come up with a way to test this change and expected
> this would fail (and your patch) would fix it based on what you're
> described.

No, the configuration you've shown should be supported and functional
already (as you've noticed, in fact). I call it 'bridging with a
foreign interface', where a foreign interface is a bridge port that has
a different switchdev mark compared to the DSA switch. A switchdev mark
is a number assigned to every bridge port by nbp_switchdev_mark_set,
based on the "physical switch id"*.

There is a simple rule with switchdev: on reception of an skb, the bridge
checks if it was marked as 'already forwarded in hardware' (checks if
skb->offload_fwd_mark == 1), and if it is, it puts a mark of its own on
that skb, with the switchdev mark of the ingress port. Then during
forwarding, it enforces that the egress port must have a different switchdev
mark than the ingress one (this is done in nbp_switchdev_allowed_egress).

The veth driver does not implement any sort of method for retrieving a
physical switch id (neither devlink nor .ndo_get_port_parent_id),
therefore the bridge assigns it a switchdev mark of 0, and packets
coming from it will always have skb->offload_fwd_mark = 0. So there
aren't any restrictions.

Problems appear as soon as software bridging is attempted between two
interfaces that have the same switchdev mark. If skb->offload_fwd_mark=1,
the bridge will say that forwarding was already performed in hw, so it
will deny it in sw.
The issue is that a bond0 (or hsr0) upper of lan0 will be assigned the
same switchdev mark as lan0 itself, because the function that assigns
switchdev marks to bridge ports, nbp_switchdev_mark_set, recurses
through that port's lower interfaces until it finds something that
implements devlink.

What I tested is actually pretty laughable and a far cry from a
real-life scenario: I commented out the .port_bridge_join and
.port_bridge_leave methods of a driver and made sure that forwarding
between ports still works regardless of what uppers they have
(even that used not to). But this bypasses the switchdev mark checks in
nbp_switchdev_allowed_egress because the skb->offload_fwd_mark=0 now.
This is an important prerequisite for seamless operation, true, but it
isn't quite what we want. For one thing, we may have a topology like
this:

         +-- br0 -+
        /   / |    \
       /   /  |     \
      /   /   |      \
     /   /    |       \
    /   /     |        \
   /    |     |       bond0
  /     |     |      /    \
 swp0  swp1  swp2  swp3  swp4

where it is desirable that the presence of swp3 and swp4 under a
non-offloaded LAG does not preclude us from doing hardware bridging
beteen swp0, swp1 and swp2.

But this creates an impossible paradox if we continue in the way that I
started in this patch.

When the CPU receives a packet from swp0 (say, due to flooding), the
tagger must set skb->offload_fwd_mark to something.

If we set it to 0, then the bridge will forward it towards swp1, swp2
and bond0. But the switch has already forwarded it towards swp1 and
swp2 (not to bond0, remember, that isn't offloaded, so as far as the
switch is concerned, ports swp3 and swp4 are not looking up the FDB, and
the entire bond0 is a destination that is strictly behind the CPU). But
we don't want duplicated traffic towards swp1 and swp2, so it's not ok
to set skb->offload_fwd_mark = 0.

If we set it to 1, then the bridge will not forward the skb towards the
ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0.
Towards swp1 and swp2 that's ok, but towards bond0? It should have
forwarded the skb there.

An actual solution to this problem, which has nothing to do with my
series, is to give the bridge more hints as to what switchdev mark it
should use for each port.

Currently, the bridging offload is very 'silent': a driver registers a
netdevice notifier, which is put on the netns's notifier chain, and
which sniffs around for NETDEV_CHANGEUPPER events where the upper is a
bridge, and the lower is an interface it knows about (one registered by
this driver, normally). Then, from within that notifier, it does a bunch
of stuff behind the bridge's back, without the bridge necessarily
knowing that there's somebody offloading that port. It looks like this:

     ip link set lan0 master br0
                  |
                  v
   bridge calls netdev_master_upper_dev_link
                  |
                  v
        call_netdevice_notifiers
                  |
                  v
       dsa_slave_netdevice_event
                  |
                  v
        oh, hey! it's for me!
                  |
                  v
           .port_bridge_join

What we should probably do to solve the conundrum is to be less silent,
and emit a notification back. Something like this:

     ip link set lan0 master br0
                  |
                  v
   bridge calls netdev_master_upper_dev_link
                  |
                  v                    bridge: Aye! I'll use this
        call_netdevice_notifiers           ^  switch_id as the
                  |                        |  switchdev mark for
                  v                        |  this port, and zero
       dsa_slave_netdevice_event           |  if I got nothing.
                  |                        |
                  v                        |
        oh, hey! it's for me!              |
                  |                        |
                  v                        |
           .port_bridge_join               |
                  |                        |
                  +------------------------+
      call_switchdev_notifiers(lan0, SWITCHDEV_BRPORT_OFFLOADED, switch_id)

Then stacked interfaces (like bond0 on top of swp3/swp4) would be
treated differently in DSA, depending on whether we can or cannot
offload them.

The offload case:

    ip link set bond0 master br0
                  |
                  v
   bridge calls netdev_master_upper_dev_link
                  |
                  v                    bridge: Aye! I'll use this
        call_netdevice_notifiers           ^  switch_id as the
                  |                        |  switchdev mark for
                  v                        |        bond0.
       dsa_slave_netdevice_event           | Coincidentally (or not),
                  |                        | bond0 and swp0, swp1, swp2
                  v                        | all have the same switchdev
        hmm, it's not quite for me,        | mark now, since the ASIC
         but my driver has already         | is able to forward towards
           called .port_lag_join           | all these ports in hw.
          for it, because I have           |
      a port with dp->lag_dev == bond0.    |
                  |                        |
                  v                        |
           .port_bridge_join               |
           for swp3 and swp4               |
                  |                        |
                  +------------------------+
      call_switchdev_notifiers(bond0, SWITCHDEV_BRPORT_OFFLOADED, switch_id)

And the non-offload case:

    ip link set bond0 master br0
                  |
                  v
   bridge calls netdev_master_upper_dev_link
                  |
                  v                    bridge waiting:
        call_netdevice_notifiers           ^  huh, no SWITCHDEV_BRPORT_OFFLOADED
                  |                        |  event, okay, I'll use a switchdev
                  v                        |  mark of zero for this one.
       dsa_slave_netdevice_event           :  Then packets received on swp0 will
                  |                        :  not be forwarded towards swp1, but
                  v                        :  they will towards bond0.
         it's not for me, but
       bond0 is an upper of swp3
      and swp4, but their dp->lag_dev
       is NULL because they couldn't
            offload it.

This is what I should have really done. For some reason though, I was so
trigger-happy that I got the data path working (without the surrounding
control logic to manage the switchdev marks automatically) that I just
got carried away and sent this small patch set.

I need some time to take my mind off of this for a while, and then I'll
come with a serious proposal eventually. Sorry again for the confusion.


*This is retrieved, in DSA's case, through the "switch_id" attribute
that we populate in dsa_port_devlink_setup. DSA says that the entire
DSA switch tree dst has the same switch_id, because it assumes that any
driver capable of cross-chip bridging (aka Marvell) is able to do
hardware forwarding towards any other switch in the same "switching
fabric". So it's not really a "switch_id", but a "port parent" somehow.