netdev - Re: mv88e6xxx broken on 6176 with "Disentangle STU from VTU"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220318230229.urddx3t7x4hk356t@skbuf>
Date:   Sat, 19 Mar 2022 01:02:29 +0200
From:   Vladimir Oltean <olteanv@...il.com>
To:     Tobias Waldekranz <tobias@...dekranz.com>
Cc:     Marek Behún <kabel@...nel.org>,
        Andrew Lunn <andrew@...n.ch>, netdev@...r.kernel.org
Subject: Re: mv88e6xxx broken on 6176 with "Disentangle STU from VTU"

On Fri, Mar 18, 2022 at 10:16:55PM +0100, Tobias Waldekranz wrote:
> On Fri, Mar 18, 2022 at 22:18, Vladimir Oltean <olteanv@...il.com> wrote:
> > Hello Tobias,
> >
> > On Fri, Mar 18, 2022 at 08:20:33PM +0100, Tobias Waldekranz wrote:
> >> On Fri, Mar 18, 2022 at 18:28, Marek Behún <kabel@...nel.org> wrote:
> >> > Hello Tobias,
> >> >
> >> > mv88e6xxx fails to probe in net-next on Turris Omnia, bisect leads to
> >> > commit
> >> >   49c98c1dc7d9 ("net: dsa: mv88e6xxx: Disentangle STU from VTU")
> >> 
> >> Oh wow, really sorry about that! I have it reproduced, and I understand
> >> the issue.
> >> 
> >> > Trace:
> >> >   mv88e6xxx_setup
> >> >     mv88e6xxx_setup_port
> >> >       mv88e6xxx_port_vlan_join(MV88E6XXX_VID_STANDALONE) OK
> >> >       mv88e6xxx_port_vlan_join(MV88E6XXX_VID_BRIDGED) -EOPNOTSUPP
> >> >
> >> 
> >> Thanks, that make it easy to find. There is a mismatch between what the
> >> family-info struct says and what the chip-specific ops struct supports.
> >> 
> >> I'll try to send a fix ASAP.
> >
> > I've seen your patches, but I don't understand the problem they fix.
> > For switches like 6190 indeed this is a problem. It has max_stu = 63 but
> > mv88e6190_ops has no stu_getnext or stu_loadpurge. That I understand.
> >
> > But Marek reported the problem on 6176. There, max_sid is 0, so
> > mv88e6xxx_has_stu() should already return false. Where is the
> > -EOPNOTSUPP returned from?
> 
> Somewhat surprisingly, it is from mv88e6xxx_broadcast_setup.

Sorry for the delay, I didn't notice the email because I was busy
gathering my jaw from the floor after relistening some of Marc Martel's
Queen covers.

This one looks a lot more plausible, let me see if I get it right below.

> Ok, I'll go out on a limb and say that _now_ I know what the problem
> is. If I uncomment .max_sid and .stu_{loadpurge,getnext} from my 6352
> (which, like the 6176, is also of the Agate-family) I can reproduce the
> same issue.
> 
> It seems like this family does not like to load VTU entries who's SID
> points to an invalid STU entry. Since .max_sid == 0, we never run
> stu_setup, which takes care of loading a valid STU entry for SID 0;
> therefore when we read back MV88E6XXX_VID_BRIDGED in
> mv88e6xxx_port_db_load_purge it is reported as invalid.
> 
> This still doesn't explain why we're able to load
> MV88E6XXX_VID_STANDALONE though...

Why doesn't it explain it? MV88E6XXX_VID_STANDALONE is 0, we have this
code so it falls in the branch that doesn't call mv88e6xxx_vtu_get():

	if (vid == 0) {
		fid = MV88E6XXX_FID_BRIDGED;
	} else {
		err = mv88e6xxx_vtu_get(chip, vid, &vlan);
		if (err)
			return err;

		/* switchdev expects -EOPNOTSUPP to honor software VLANs */
		if (!vlan.valid)
			return -EOPNOTSUPP;

		fid = vlan.fid;
	}

> Vladimir, any advise on how to proceed here? I took a very conservative
> approach to filling in the STU ops, only enabling it on HW that I could
> test. I could study some datasheets and make an educated guess about the
> full range of chips that we could enable this on, and which version of
> the ops to use. Does that sound reasonable?

Before, MV88E6XXX_G1_VTU_OP_STU_LOAD_PURGE was done from 2 places:

mv88e6352_g1_vtu_loadpurge()
mv88e6085_ops, mv88e6097_ops, mv88e6123_ops, mv88e6141_ops, mv88e6161_ops,
mv88e6165_ops, mv88e6171_ops, mv88e6172_ops, mv88e6175_ops, mv88e6176_ops,
mv88e6240_ops, mv88e6341_ops, mv88e6350_ops, mv88e6351_ops, mv88e6352_ops

mv88e6390_g1_vtu_loadpurge()
mv88e6190_ops, mv88e6190x_ops, mv88e6191_ops, mv88e6290_ops, mv88e6390_ops,
mv88e6390x_ops, mv88e6393x_ops

After the change, MV88E6XXX_G1_VTU_OP_STU_LOAD_PURGE is done only from the ops
that have stu_loadpurge:

mv88e6352_g1_stu_loadpurge()
mv88e6097_ops, mv88e6352_ops

mv88e6390_g1_stu_loadpurge()
mv88e6390_ops, mv88e6390x_ops, mv88e6393x_ops

So if I understand correctly, we have this regression for all families that are
in the first group but not in the second group. I.e. a lot of families.

There's nothing wrong with being conservative, as long as you're a
correct conservative. In this case, I believe that the switch families
where you couldn't test MSTP should at least have a max_sid of 1, to
allow SID 0 to be loaded. So you don't have to claim untested MSTP
support. But then you may need to refine the guarding that allows for
MSTP support, to check for > 1 instead of > 0.