[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87360m7acf.fsf@waldekranz.com>
Date: Fri, 04 Dec 2020 00:12:32 +0100
From: Tobias Waldekranz <tobias@...dekranz.com>
To: Vladimir Oltean <olteanv@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, andrew@...n.ch,
vivien.didelot@...il.com, f.fainelli@...il.com,
j.vosburgh@...il.com, vfalico@...il.com, andy@...yhouse.net,
netdev@...r.kernel.org
Subject: Re: [PATCH v3 net-next 2/4] net: dsa: Link aggregation support
On Thu, Dec 03, 2020 at 23:57, Vladimir Oltean <olteanv@...il.com> wrote:
> On Thu, Dec 03, 2020 at 09:53:08PM +0100, Tobias Waldekranz wrote:
>> I am not sure which way is the correct one, but I do not think that it
>> necessarily _always_ correct to silently fallback to a non-offloaded
>> mode.
>
> Of course, neither is fully correct. There is always more to improve on
> the communication side of things. But DSA, which stands for "Not Just A
> Switch", promises you, above all, a port multiplexer with the ability to
> make full use of the network stack. Maybe I'm still under the influence
> of a recent customer meeting I had, but there is a large base of use
> cases there, where people just don't seem to have enough ports, and they
> can just add a $3 switch to their system, and voila, problem solved.
> Everything works absolutely the same as before. The newly added switch
> ports are completely integrated with the kernel's IP routing database.
> They aren't even switch ports, they're just ports.
>
> And even there, on the software fallback front, we don't do enough,
> currently. I've had other customers who said that they even _prefer_
> to do bridging in software, in order to get a chance to run their
> netfilter/ebtables based firewall on their traffic. Of course, DSA
> currently has no idea when netfilter rules are installed, so I just told
> them to hack the driver and remove the bridging offload, which they
> happily did.
Totally agree with your customer, we should be able to disable all
offloading for an individual port and run it in a plain "NIC mode". In
fact, that might be what opens up a third option for how LAG offloading
should be handled.
> I suspect you're looking at this from the wrong angle. I did, too, for
> the longest time, because I was focused on squeezing the most out of the
> hardware I had. And "I feel cheated because the operating system sits
> between me and the performance I want from my hardware" is an argument
> I've heard all too often. But not everybody needs even gigabits of
> traffic, or absurdly low latency. Making a product that works acceptably
> and at a low cost is better than hand-picking only hardware that can
> accelerate everything and spending $$$$ on it, for a performance
> improvement that no user will notice. Doing link aggregation in software
> is fine. Doing bridging in software is fine. Sure, the CPU can't compete
> past a number of KPPS, but maybe it doesn't even have to.
I really get the need for being able to apply some CPU-duct-tape to
solve that final corner case in a network. Or to use it as a cheap port
expander.
> Again, this doesn't mean that we should not report, somehow, what is
> offloaded and what isn't, and more importantly, the reason why
> offloading the set of required actions is not supported. And I do
> realize that I wrote a long rant about something that has nothing to do
> with the topic at hand, which is: should we deny bonding interfaces that
> use balance-rr, active-backup, broadcast, balance-tlb, balance-alb on
> top of a DSA interface? Hell no, you wouldn't expect an intel e1000 card
> to error out when you would set up a bonding interface that isn't xor or
> 802.3ad, would you? But you wouldn't go ahead and set up bridging
> offload either, therefore running into the problem I raised above with
> netfilter rules. You would just set out to do what the user asked for,
> in the way that you can, and let them decide if the performance is what
> they need or not.
You make a lot of good points. I think it might be better to force the
user to be explicit about their choice though. Imagine something like
this:
- We add NETIF_F_SWITCHDEV_OFFLOAD, which is set on switchdev ports by
default. This flag is only allowed to be toggled when the port has no
uppers - we do not want to deal with a port in a LAG in a bridge all
of a sudden changing mode.
- If it is set, we only allow uppers/tc-rules/etc that we can
offload. If the user tries to configure something outside of that, we
can suggest disabling offloading in the error we emit.
- If it is not set, we just sit back and let the kernel do its thing.
This would work well both for exotic LAG modes and for advanced
netfilter(ebtables)/tc setups I think. Example session:
$ ip link add dev bond0 type bond mode balance-rr
$ ip link set dev swp0 master bond0
Error: swp0: balance-rr not supported when using switchdev offloading
$ ethtool -K swp0 switchdev off
$ ip link set dev swp0 master bond0
$ echo $?
0
>> > Should we add an array of supported TX types that the switch port can
>> > offload, and that should be checked by DSA in dsa_lag_offloading?
>>
>> That would work. We could also create a new DSA op that we would call
>> for each chip from PRECHANGEUPPER to verify that it is supported. One
>> advantage with this approach is that we can just pass along the `struct
>> netdev_lag_upper_info` so drivers always have access to all information,
>> in the event that new fields are added to it for example.
>
> Hmm, not super pleased to have yet another function pointer, but I don't
> have any other idea, so I guess that works just fine.
Yeah I do not like it either, maybe it is just the least bad thing.
> I would even go out on a limb and say hardcode the TX_TYPE_HASH in DSA
> for now. I would be completely surprised to see hardware that can
> offload anything else in the near future.
If you tilt your head a little, I think active backup is really just a
trivial case of a hashed LAG wherein only a single member is ever
active. I.e. all buckets are always allocated to one port (effectivly
negating the hashing). The active member is controlled by software, so I
think we should be able to support that.
mv88e6xxx could also theoretically be made to support broadcast. You can
enable any given bucket on multiple ports, but that seems silly.
Powered by blists - more mailing lists