lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Mar 2022 16:06:08 +0000
From:   "Russell King (Oracle)" <linux@...linux.org.uk>
To:     Vladimir Oltean <olteanv@...il.com>
Cc:     Andrew Lunn <andrew@...n.ch>,
        Heiner Kallweit <hkallweit1@...il.com>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        Vivien Didelot <vivien.didelot@...il.com>,
        Florian Fainelli <f.fainelli@...il.com>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH net] net: dsa: silence fdb errors when unsupported

On Wed, Mar 09, 2022 at 05:51:47PM +0200, Vladimir Oltean wrote:
> On Wed, Mar 09, 2022 at 02:18:48PM +0000, Russell King (Oracle) wrote:
> > On Wed, Mar 09, 2022 at 12:41:43PM +0200, Vladimir Oltean wrote:
> > > Hello Russell,
> > > 
> > > On Wed, Mar 09, 2022 at 10:35:32AM +0000, Russell King (Oracle) wrote:
> > > > When booting with a Marvell 88e6xxx switch, the kernel spits out a
> > > > load of:
> > > > 
> > > > [    7.820996] mv88e6085 f1072004.mdio-mii:04: port 3 failed to add aa:bb:cc:dd:ee:ff vid XYZ1 to fdb: -95
> > > > [    7.835717] mv88e6085 f1072004.mdio-mii:04: port 2 failed to add aa:bb:cc:dd:ee:ff vid XYZ1 to fdb: -95
> > > > [    7.851090] mv88e6085 f1072004.mdio-mii:04: port 1 failed to add aa:bb:cc:dd:ee:ff vid XYZ1 to fdb: -95
> > > > [    7.968594] mv88e6085 f1072004.mdio-mii:04: port 0 failed to add aa:bb:cc:dd:ee:ff vid XYZ1 to fdb: -95
> > > > [    8.035408] mv88e6085 f1072004.mdio-mii:04: port 3 failed to add aa:bb:cc:dd:ee:ff vid XYZ3 to fdb: -95
> > > > 
> > > > while the switch is being setup. Comments in the Marvell DSA driver
> > > > indicate that "switchdev expects -EOPNOTSUPP to honor software VLANs"
> > > > in mv88e6xxx_port_db_load_purge() so this error code should not be
> > > > treated as an error.
> > > > 
> > > > Fixes: 3dc80afc5098 ("net: dsa: introduce a separate cross-chip notifier type for host FDBs")
> > > > Signed-off-by: Russell King (Oracle) <rmk+kernel@...linux.org.uk>
> > > > ---
> > > > Hi,
> > > > 
> > > > I noticed these errors booting 5.16 on my Clearfog platforms with a
> > > > Marvell DSA switch. It appears that the switch continues to work
> > > > even though these errors are logged in the kernel log, so this patch
> > > > merely silences the errors, but I'm unsure this is the right thing
> > > > to do.
> > > 
> > > Can you please confirm that these errors have disappeared on net-next?
> > 
> > net-next: no warnings
> > v5.17-rc7: warnings
> > v5.16: warnings
> 
> Thanks. This means it was solved by this patch set, which had the exact
> same symptoms:
> https://patchwork.kernel.org/project/netdevbpf/cover/20220215170218.2032432-1-vladimir.oltean@nxp.com/
> The cover letter and commit messages provide a full description of what
> was wrong, has been wrong/a limitation for years, and give a sense of
> why that work cannot be backported to v5.16, since it implies a
> non-trivial amount of changes to the bridge driver, to switchdev and to
> DSA.
> 
> > So, it looks like we need a patch for 5.17-rc7 and 5.16-stable to fix
> > this. Do you have a better suggestion than my patch?
> 
> So to answer the question first, then to explain.
> 
> My better suggestion, taking into consideration more factors, is to
> drop this patch and leave the code as it is.
> 
> The comment in the Marvell DSA driver, which you cited:
> 
> mv88e6xxx_port_db_load_purge:
> 	/* switchdev expects -EOPNOTSUPP to honor software VLANs */
> 	if (!vlan.valid)
> 		return -EOPNOTSUPP;
> 
> is a logical fallacy.
> 
> Fact: the bridge does _not_ check errors from br_switchdev_fdb_notify().
> Not only that, but there isn't even any mechanism in place today such
> that this would be possible. Otherwise, myself, Nikolay and Ido wouldn't
> have been unsuccessfully scrambling for several months to address that
> limitation (actually mostly me, but they were the direct victims of
> trying to review my attempts):
> https://patchwork.kernel.org/project/netdevbpf/cover/20211025222415.983883-1-vladimir.oltean@nxp.com/
> 
> So if switchdev is not aware of mv88e6xxx returning -EOPNOTSUPP from a
> work item that is scheduled from the SWITCHDEV_FDB_ADD_TO_DEVICE atomic
> handler, then logically, switchdev cannot anything from -EOPNOTSUPP.
> This error code has no special meaning.
> 
> Note that there was a second call path to drivers' ds->ops->port_fdb_add(),
> from the DSA slave's ndo_fdb_add, and that did propagate errors. But
> that is "bridge bypass", not switchdev, and for doing more harm than
> good, support for it was removed in commit b117e1e8a86d ("net: dsa:
> delete dsa_legacy_fdb_add and dsa_legacy_fdb_del"). Not only is this
> call path no longer present today, but -EOPNOTSUPP isn't a special error
> code there, either.
> 
> Having said this, the initiator of the call path which fails with
> -EOPNOTSUPP in mv88e6xxx_port_db_load_purge() is the bridge driver,
> via switchdev.
> 
> The mv88e6xxx driver, via that check, essentially expects that the VID
> from the FDB entry has been previously mapped in the VTU.
> 
> The bridge driver already ensures that the VLAN structure has been
> created, before any FDB entries can be added in that VLAN. This is
> proven by the attempt below to add an FDB entry in VLAN 5, in bridge
> port swp0:
> 
> root@...ian:~# bridge fdb add dev swp0 static master 00:01:02:03:04:05 vlan 5
> [109407.613593] bridge: RTM_NEWNEIGH with unconfigured vlan 5 on swp0
> RTNETLINK answers: Invalid argument
> root@...ian:~# bridge vlan
> port    vlan ids
> swp0     1 PVID Egress Untagged
> 
> br0      1 PVID Egress Untagged
> 
> root@...ian:~# bridge link
> 9: swp0@...2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state disabled priority 32 cost 4
> 
> So the sanity check captures what is essentially an invalid condition.
> If the bridge notifies an FDB entry in a VLAN that isn't present in the
> VTU, this is a bug.
> 
> Now on to my point. There is a trade-off to be made between things
> running apparently smoothly and running correctly.
> 
> Since the DSA switchdev work item is the final entity that sees error
> codes propagated from the driver, I find it valuable that these error
> codes are not hidden in any way. I do not have access to all hardware,
> I cannot foresee all code paths, but yet, users who notice errors in the
> kernel log report these problems, we look into them and we fix them.
> In fact, this is exactly the way in which Rafael pointed out to me the
> mv88e6xxx issue, and this led to the aforementioned bridge/switchdev/DSA
> rework. It is a process that works, and I would like it to not be
> disrupted.
> 
> This isn't to say that there is no argument to be made for hiding these
> error messages. There is a possibility that I might agree with silencing
> some errors in some cases, but I have not heard a compelling argument
> for that. Also, with mv88e6xxx and systemd-networkd, it isn't even the
> first warning/error of its kind that gets printed by this driver out of
> the blue. The driver also warns when systemd-networkd reinitializes
> bridge VLANs on ports, stating that "port X already a member of VLAN y".
> So I believe that mv88e6xxx users already have some tolerance for
> non-fatal warnings. These latter warnings, by the way, have also
> disappeared as a result of the patch set I linked in the beginning.
> 
> Therefore, I think it is in the best interest of the code base for users
> to report the problems they see, instead of masking them. This means
> walking through potentially known and already fixed problems/limitations,
> debugging, explaining, testing, writing long emails etc, but I prefer
> this to the alternative of cutting myself from a potentially valid
> source of information.

So, to summarise your very verbose email.

5.13 didn't spit out these errors.

Changes happened between 5.13 and 5.16 that exposed these errors and
the errors continue to be exposed in 5.17, leading users to panic
believing that something is wrong.

You don't want to hide the errors.

It will be fixed in 5.18.

Thanks, but I think I'll keep my patch in my tree so I don't have to
look at a meaningless error that doesn't appear to affect operation.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ