netdev - Multi-PHYs and multiple-ports bonding support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221017105100.0cb33490@pc-8.home>
Date:   Mon, 17 Oct 2022 10:51:00 +0200
From:   Maxime Chevallier <maxime.chevallier@...tlin.com>
To:     netdev@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        Thomas Petazzoni <thomas.petazzoni@...tlin.com>,
        Antoine Tenart <atenart@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Heiner Kallweit <hkallweit1@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Tobias Waldekranz <tobias@...dekranz.com>,
        Oleksij Rempel <o.rempel@...gutronix.de>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Multi-PHYs and multiple-ports bonding support

Hello everyone,

I'm reaching out to discuss a PHY topic that we would like to see
upstreamed, to support multiple ports attached to a MAC.

The end-goal is to achieve some redundancy in case of a physical link
interruption, in a transparent manner, but using only one network
interface (1 MAC).

We've been made aware that some products in the wild propose this
feature-set, using 2 PHYs connected to the same MAC, using some custom
logic to switch back and forth between the 2 PHYs, and that's the main
use-case we'd like to see supported :

                            +-------+
                     /----- |  PHY  | --- BaseT port
 +-------+           |      +-------+
 |  MAC  |-- RGMII --|
 +-------+           |      +-------+
                     \----- |  PHY  | --- BaseT port
                            +-------+

This configuration comes with quite a lot of challenges since we bend
the existing standards in numerous ways :

 - We have 2 PHYs on the same xMII bus, and they can't be active on that
   bus at the same time. To solve that, we have 2 strategies: 
 
   - Put the PHY in isolate mode when not in use, they can perform link
     detection and reporting, but wont communicate on the MII bus.
     This can have side effects if both links are connected to the same
     network, which can be addressed through the use of gratuitous ARPs
     to make sure the right link gets known by the spanning-tree.
     
   - Put PHY down entirely when not is use, select an active PHY, and
   when the link goes down on that PHY, switch to the other. This was
   used on products that had PHYs were the isolate mode is broken.
     
Upstream, we have one device that does something a bit similar, which is
the macchiatobin, using the 88x3310 PHY. This PHY exports both an SFP
interface as long as a copper BaseT interface. These 2 interfaces are
connected to the same MAC and are mutually exclusive.

It looks like this :

 +-------+              +---------+   |---- Copper BaseT
 |  MAC  | -- xxxMII -- |   PHY   |---|
 +-------+              +---------+   |---- SFP
 
We don't have any way to control which port gets used, the first that
has the link gets the link.

Ideally we would like to be able to configure every aspects of these
2 cases, like :
 - Which link do we use
 - Do we switch automatically from one to the other
 - What are the links available

I see 4 different aspects of this that would need to be added for this
whole mechanism to work :

1) DT representation

To support that, we would need a way to give knowledge to the kernel
about the numer of physical ports that are connected to a given MAC.
In the dual-phy mode, it's pretty straightforward, since we would
"just" need to pass multiple phy handles to the mac node. In the MCBin
case, it's a bit more complex, since we don't have a clear view on the
number of ports connected to a given phy.

The assumption is that we have only one port per phy, and it's nature is
derived from the presence of an sfp=<> phandle in the DT, plus the
driver itself specifying the phydev->port field (which to my knowledge
isn't used that much ?)

The subject of describing the ports a PHY exposes in a sensible way that
doesn't require changing all DTs out-there has been discussed in the
past here :
https://lore.kernel.org/netdev/20201119152246.085514e1@bootlin.com/

If we only focus on the dual-phy use-case - and not the single-phy
dual-port - we might not have to deal with extensive DT changes at all.

2) Changes in Phylink

This might be the tricky part, as we need to track several ports,
possibly connected to different PHYs, to get their state. For now, I
haven't prototyped any of this yet.

The goal would be to allow either automatic switching, as is already
done by the 3310 driver, but at a higher level. Phylink might not be the
right place to do that, so maybe we just want to expose an API to get
the possible ports on a given interface, their repective state, and a
way to select one

My idea would be to introduce a notion of a struct phy_port, that would
describe a physical port. They would be controlled by a PHY (or a MAC,
if the mac outputs 1000BaseX for example), one phy can
possibly control multiple ports.

The whole link redundancy would then be done manipulating ports, giving
a layer of abstraction on the hardware topology itself.

We would therefore abstract the logic by having :
                         +--------+
                     /---|  Port  |
 +-------------+     |   +--------+
 |  netdevice  | ----|
 +-------------+     |
                     |   +---------+
                     \---|  Port   |
                         +---------+

This is the representation the userspace would know about, without
necessarily having to worry about the phys inbetween.

I don't see that as a breaking change, since as of today, most systems
only have one port per netdevice. We would need to add a way to deal
with multiple ports per netdevice.

3) Adding a L2 bonding driver

If the link switching logic is deported outside of phylink, we might
want a generic way of bonding ports on an interface, configuring the
policy to use for the switching (automatic, manual selection, maybe
more like trying to elect the link with the highest speed ?). This is
where we would handle sending the gratuitous ARPs upon link switching
too.

3) UAPI

From userspace, we would need ways to list the ports, their state, and
possibly to configure the bonding parameters. for now in ethtool, we
don't have the notion of port at all, we just have 1 netdevice == 1
port. Should we therefore create one netdevice per port ? or stick to
that one interface and refer to its ports with some ethtool parameters ?

All of these are open questions, as this topic spans quite a lot of
aspects in the stack. Any input, idea, comment, are very very welcome.

Thanks,

Maxime