netdev - Re: DSA: some questions regarding TX forwarding offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f6974437-4e5c-802f-a84c-52b1e9506660@bang-olufsen.dk>
Date:   Tue, 5 Oct 2021 12:06:38 +0000
From:   Alvin Šipraga <ALSI@...g-olufsen.dk>
To:     Vladimir Oltean <vladimir.oltean@....com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Florian Fainelli <f.fainelli@...il.com>,
        Andrew Lunn <andrew@...n.ch>
Subject: Re: DSA: some questions regarding TX forwarding offload

On 10/5/21 12:12 PM, Vladimir Oltean wrote:
> On Tue, Oct 05, 2021 at 08:54:34AM +0000, Alvin Šipraga wrote:
>> Hi,
>>
>> I am trying to implement TX forwarding offload for my in-progress
>> rtl8365mb DSA driver. I have some questions which I could use some
>> clarification on. They might be specific to my hardware, which is also
>> OK, but then some advice on how to proceed would be helpful.
>>
>> Q1. Can the tagging driver somehow retrieve a port mask from the DSA
>> switch driver in order to assemble the CPU->switch tag on xmit? Is there
>> some infrastructure in place to share such data between the two drivers?
> 
> Nope. DSA does not maintain a cache of FDB entries retrieved from
> hardware. So it cannot deduce the destination port mask from the {MAC DA, VLAN ID}
> of the skb. The software FDB maintained by the bridge driver is all there is.
> Based on the software FDB, which the bridge still looks up, an skb->dev
> is selected. All that the TX forwarding offload feature is is a way to
> remove software packet replication (skb_clone) for the case where the
> packet should have been flooded, or multicast, by the software bridge
> towards multiple skb->dev entities belonging to the same hardware domain.
> 
> To achieve the desired replication in hardware with DSA, the idea is to
> look up the FDB once more, but this time let the switch do it in hardware.

Right, yes, this part was more or less clear to me. I did not mean to 
imply that the port mask should be the exact forwarding port mask, but 
rather just an "allowance" port mask. But I think you understood the 
context given what you wrote below.

> 
> I see it similar to the quote "life is like a box of chocolates, you
> never know what you're going to get". Meaning that ok, you don't know
> exactly from software on which egress ports your packet is going to
> land, but the result shouldn't be too far off from the expectation in
> any case:
> 
> (a) hardware FDB and software FDB are in sync for the given {MAC DA, VLAN ID}:
>      packet will be forwarded in hardware towards the same port as it
>      would have without the TX forwarding offload feature
> 
> (b) FDB entry exists in software, but not in hardware: packet will be
>      sent once by the bridge, and will be flooded by the hardware towards
>      all bridge ports belonging to the switch's hardware domain
> 
> (c) FDB entry exists in hardware, but not in software: packet will be
>      "flooded" by the software bridge, but the switch will deliver it
>      precisely. Flooding is therefore avoided.
> 
> (d) FDB entry does not exist in hardware or in software: see case (a)
> 
>> Q2. Is it expected by DSA that two isolated ports (e.g. two ports
>> belonging to two separate bridges) can be members of the same VLAN
>> without issue?
> 
> It depends.
> 
> If you mean to ask: "given the way in which the DSA core is structured,
> what do you expect to happen?", the answer is that it won't work without leaks.
> 
> If you mean to ask: "what is the intention going forward?", the answer
> is that it should be made to work, and you should employ hardware specific
> mechanisms to avoid those leaks between VLAN N of br0 and VLAN N of br1,
> or deny the simultaneous existence of a VLAN-aware br0 and a VLAN-aware br1.
> 
> For example, right now you should at least impose the latter restriction,
> see for example sja1105_prechangeupper().

Thanks for the reference.

> 
> In the long term, you should get acquainted with your hardware's FDB
> isolation mechanism, because there will exist an API through which DSA
> will tell you "this switchdev object (FDB, MDB, VLAN) came from this
> bridge, which I've associated for you with a unique integer, just so you
> know when you program it to hardware, I might come back with an
> identical switchdev object later but on a different port, and belonging
> to a different bridge":
> https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/>> 

The FDB isolation mechanism in my switch seems to be pretty good. As 
long as I can pass along *some* information from the switch driver to 
the tagging driver - namely the "allowance port mask" for a given bridge 
- I think I should be able to achieve full isolation between up to 7 
VLAN-aware bridges and with no restrictions on the number of VLANs per 
bridge, nor on the sharing of VLANs per bridge.

Here is a quick summary of the relevant behaviour of the switch:

VLANs programmed on the switch can be set to either SVL or IVL, on a 
per-VLAN basis. This affects how learned MAC addresses are searched 
for/saved in the hardware FDB:

   - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
   - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.

EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.

Unlike the FID - which is programmed per-VLAN - the EFID is programmed 
per-port. When a port has learning enabled and it receives an ingress 
frame with a given VID and MAC SA, it will search in the hardware FDB 
with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if 
the entry is not found, it will create a new one. This allows the switch 
to learn the same {VID, MAC SA} pair on two separate ports, provided 
those ports have different EFIDs.

With that in mind, I intend to enable IVL by default for all VLANs 
programmed to the hardware, and to reserve a given EFID - say EFID=0 - 
for all standalone ports which by definition will never learn anything. 
Together with the port isolation mechanism which is analogous to Linus' 
recent RTL8366RB changes, this should ensure that all ingress frames on 
a standalone port are trivially flooded to the CPU port only. This 
leaves 7 more EFIDs to use, which can each be mapped to a given 
bridge_num, such that we can support 7 hardware bridges with TX 
forwarding offload and IVL.

Finally, I want to implement TX forward offloading, and to selectively 
enable learning on the CPU port for frames with skb->tx_fwd_offload == 
true. More on that below in this mail.

> The most flexible FDB isolation mechanism I've seen so far is in
> mv88e6xxx, you can freely associate a VID with a FID (of which there are
> 4K entries) and FDB lookup is performed by {FID, MAC DA}. This patch has
> the details of where mv88e6xxx is right now and what can be done further:
> https://patchwork.kernel.org/project/netdevbpf/patch/20211005001414.1234318-5-vladimir.oltean@nxp.com/>> 
> So with that hardware, you can have 2 VLAN-aware bridges, and both
> bridges can use the full 4K VID space numerically, but in total you
> cannot have more than 4K FIDs in the system, so 1000 VLANs on one bridge
> and 3000 on the other, or distributions like that. Numerically, the VIDs
> of one bridge can be identical to the VIDs of another as long as FIDs
> are unique.
> 
>> Background: The RTL8365MB's CPU tag includes an ALLOW field followed by
>> a "port mask" field. If ALLOW=1 then - based on the VLAN tag in the
>> frame and the port mask - the switch will automatically replicate the
>> frame and egress it on all suitable ports, but only ports which are in
>> the port mask.
>>
>> If ALLOW=1, and if the port mask is all zeroes or all ones, then the
>> switch will make its forwarding decision based only on the VLAN tag in
>> the frame (if any). Now consider a configuration as follows:
> 
> When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
> is not taken into consideration? Are packets flooded towards the entire
> set of ports in the allowance port mask that are members of VLAN N?

My choice of words was imprecise. I do not mean to say that the MAC DA 
is ignored. If there exists a suitable entry in the hardware FDB for 
that MAC DA and VLAN n, the switch will _not_ flood the entire set of 
ports in the allowance port mask that are members of VLAN n. Instead, it 
_will_ respect the information contained the hardware FDB and forward 
the packet only on the given port(s) in the FDB. Note that the switch 
will still respect the allowance port mask, so this is something like: 
forwarding_portmask = (fdb_portmask & allowance_portmask).

> Do you have address learning properly set up, and can you confirm with
> an FDB dump that the FDB is not in fact empty in the FID you are
> injecting in (see below)?

I have it set up properly - or as much as I can, some things are still 
to be ironed out - and I can dump the FDB and see that learning is 
taking place according to my expectations which I described upstairs. 
Note I have only tested this for unicast so far, although I think the 
rules for multicast are not dissimilar.

> 
>>           br0            br1
>>            +              +
>>            |              |
>>        +---+---+      +---+---+
>>        |       |      |       |
>>       swp0    swp1   swp2    swp3
>>
>> ... with both bridges containing switch port(s) belonging to the same
>> VLAN n. How should I prevent - with TX forwarding offload - a packet
>> with VID=n from being egressed on a port on the opposite bridge which
>> belongs to the same VLAN n?
>>
>> In the above scenario, either I must refine the CPU tag "port mask"
>> (hence Q1), or I must restrict the hardware configuration in some way
>> (hence Q2), or I must conclude that TX forwarding offload is not
>> possible with these constraints, or there is some alternative solution
>> or nuance that I have not thought of.
> 
> I don't want to answer any of these questions until I understand how
> does your hardware intend the FID and FID_EN bits from the DSA header to
> be used. The FID only has 2 bits, so it is clear to me that it doesn't
> have the same understanding of the term as mv88e6xxx, if the Realtek
> switch has up to 4 FIDs while Marvell up to 4K.

I came to the same conclusion until I started to play around with this, 
only to discover that the Realtek documentation is wrong.

First, so that we are on the same page, here again is the relevant part 
of the CPU tag we are talking about:

0                                  7|8                                 15
|-----------------------------------+-----------------------------------|
|                               (16-bit)                                |
|                       Realtek EtherType [0x8899]                      |
|-----------------------------------+-----------------------------------|
|              (8-bit)              |              (8-bit)              |
|          Protocol [0x04]          |              REASON               |
|-----------------------------------+-----------------------------------|
|   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
| FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
|-----------------------------------+-----------------------------------|
  ^^^^^^^^^^^^^^^^^^^^
      look here

What actually appears to be the case - at least in the IVL case I 
described above - is that the fields FID_EN and FID should rather be 
named EFID_EN and EFID. Moreover, the reserved bit X between the two 
fields is actually an extension of the newly-named EFID field. So things 
look more like this:

|-----------------------------------+-----------------------------------|
|   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
| EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
|-----------------------------------+-----------------------------------|
  ^^^^^^^^^^^^^^^^^^^^
      look here

If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port, 
then the switch will learn the MAC SA of the frame and enter it into the 
FDB with the corresponding VID (according to the 802.1Q tag) and the 
corresponding EFID (according to the CPU tag and the EFID field). This 
is super useful because it enables the strategy I outlined above, and 
also avoids having to rely on the assisted_learning_on_cpu_port flag.

What this doesn't do is help me with the actual forwarding decision of 
the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a 
"catch all" allowance port mask of 0 or ~0 (I tested both), the switch 
would only consider forwarding the frame to ports with EFID == k. This 
is not the case however, and it seems that the EFID_EN/EFID fields of 
the CPU tag only affect how the switch learns from this frame. Hence my 
need to "tune" the allowance port mask.

In case you are suspicious when I say the documentation is wrong: I 
tested this behaviour quite heavily in order to come to this conclusion. 
The EFID field is indeed 3 bits - and this matches with the definition 
of EFID in the datasheet of the chip - and by setting it to some 3-bit 
value like 7, I see this reflected in the hardware FDB after dumping it.

> 
> You should ask yourself not only how to prevent leakage, but also the
> flip side, how should I pass the packet to the switch in such a way that
> it will learn its MAC SA in the right FID, assuming that you go with FDB
> isolation first and figure that out. Once that question is answered, you
> can in premise specify an allowance port mask which is larger than
> needed (the entire mask of user ports) and the switch should only
> forward it towards the ports belonging to the same FID, which are
> roughly equivalent with the ports under a specific bridge. You can
> create a mapping between a FID and dp->bridge_num. Makes sense or am I
> completely off?

Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to 
discuss the situation with you because I know that you have some planned 
changes for net-next and I was not sure if this method is considered 
acceptable in DSA land. I hope the explanation above clarifies the 
situation a bit. I will go ahead now and try to implement this mapping 
between bridge_num and EFID, so that the tagging driver can look up the 
correct allowance port mask on xmit of a bridge frame.

Thanks for your detailed response.

	Alvin