[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240918111008.uzvzkcjg7wfj5foa@skbuf>
Date: Wed, 18 Sep 2024 14:10:08 +0300
From: Vladimir Oltean <vladimir.oltean@....com>
To: Russell King <linux@...linux.org.uk>, Andrew Lunn <andrew@...n.ch>,
Florian Fainelli <f.fainelli@...il.com>
Cc: Saravana Kannan <saravanak@...gle.com>, netdev@...r.kernel.org
Subject: Component API not right for DSA?
Hi,
DSA probing in a multi-switch tree is a bit of a hack job.
It relies on the device tree providing complete "routing" information
(the "link" property between one cascade port and the cascade ports of
all other switches reachable through it) to determine whether the tree
has completely probed or not.
A switch which probes but finds that it has a dst->rtable entry (struct
dsa_link) towards another switch which hasn't probed will just report
a successful early probe here:
static int dsa_tree_setup(struct dsa_switch_tree *dst)
{
...
complete = dsa_tree_setup_routing_table(dst);
if (!complete)
return 0;
...
}
and will mostly do nothing, except for listing its ports (dsa_port_touch())
in dst->ports.
The dst->ports puzzle becomes complete piece by piece, and when the
final piece of the puzzle probes (last switch of the tree), it sees that
all its link_dp ports are present in dst->ports, so complete=true, and
the code flow actually proceeds further in dsa_tree_setup().
The dsa_tree_setup() runs in the context of the last switch to probe,
and calls ds->ops->setup() etc for all other switches in this tree.
I don't like this for the following reasons:
1. I would like the device tree binding to make the "link" property optional
(full routing table). Only direct adjacency information is sufficient
to compute the rest, plus it is much easier on the device tree writer
(https://lore.kernel.org/netdev/20240913131507.2760966-3-vladimir.oltean@nxp.com/).
But I would have to keep this workaround functional - otherwise a
switch will not wait for the entire tree to probe, just for its
directly connected neighbors. Thus, I would need to run BFS to
construct a routing table for probe time (_before_ struct dsa_switch
and struct dsa_port are allocated for the other switches in the tree),
and then I also need the regular variant of the rtable, that
dsa_routing_port() uses. Could be done, but not great.
2. I honestly don't think that the workaround to wait until the routing
table is complete is in the best interest of DSA. The larger context
here is that one can imagine DSA trees operating in a "degraded state"
where not all switches are present. For example, if there is a chain
of 3 switches and the last switch is missing, nothing prevents the
first 2 from doing their normal job. There is actually a customer who
wants to take down a switch for regular maintainance, while keeping
the rest of the system operational. This is currently not possible
with DSA, because the tree wants all switches to be present. As soon
as one single switch unbinds, the entire tree is torn down. They are
pretty serious about this request, not just "would be nice if".
And of course, not any degraded state makes functional sense. For
example, removing the top-most switch must result in all switches
downstream of it also getting removed, because traffic from the CPU
can no longer reach them. That's fine, and I plan to handle that
using device links from one switch to its upstream.
For reason #1, I started prototyping an integration between DSA and the
component API - something which was also suggested by Saravana Kannan in
the past. The component master is the tree, and the components are the
switches. For each struct dsa_switch_tree, I instantiated a struct
platform_device and a struct component_master_ops. The tree is created
by the first switch that calls dsa_register_switch(). It has a function
that explores the device tree using BFS, starting from that first switch
that created it, "link" phandle by phandle, calling component_match_add_release()
for the OF node of each other reachable switch. Every other switch in
the tree no longer creates it, but finds it and bumps its refcount.
The tree is bound when all switches have called component_add() with
their struct component_ops.
This is all great, but then I realized that, for addressing issue #2,
it is no better than what we currently have. Namely, by default the tree
looks like this:
$ cat /sys/kernel/debug/device_component/dsa_tree.0.auto
aggregate_device name status
-------------------------------------------------------------
dsa_tree.0.auto bound
device name status
-------------------------------------------------------------
d0032004.mdio-mii:11 bound
d0032004.mdio-mii:10 bound
d0032004.mdio-mii:12 bound
but after this operation:
$ echo d0032004.mdio-mii:11 > /sys/bus/mdio_bus/devices/d0032004.mdio-mii\:11/driver/unbind
$ cat /sys/kernel/debug/device_component/dsa_tree.0.auto
aggregate_device name status
-------------------------------------------------------------
dsa_tree.0.auto not bound
device name status
-------------------------------------------------------------
(unknown) not registered
d0032004.mdio-mii:10 not bound
d0032004.mdio-mii:12 not bound
the tree (component master) is unbound, its unbind() method calls
component_unbind_all(), and this also unbinds the other switches.
The idea that a "degraded state" doesn't make sense for the component
master seems pretty fundamental to the entire component API. But I
figured I'd ask, before I put the idea of using this API to rest.
[ although I do like the debugfs device_component folder ]
Otherwise, my plan is to go ahead and introduce new dsa_switch_ops API
to let them know of dynamic updates to the routing table. This way, we
remove the baked assumption that all of it is available at ds->ops->setup()
time and never changes afterwards. There is nothing, functionally speaking,
that the component API can help me with, for this desired behavior.
Just thought I'd let everybody know of my current intention of making
changes to DSA, and gather some opinions/confirmation that I am on the
right track/alternative suggestions.
Thanks,
Vladimir
Powered by blists - more mailing lists