netdev - Re: DSA: some questions regarding TX forwarding offload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3603d4eb-61d0-e64d-3e9d-53803d8c46a4@bang-olufsen.dk>
Date:   Tue, 5 Oct 2021 14:38:28 +0000
From:   Alvin Šipraga <ALSI@...g-olufsen.dk>
To:     Vladimir Oltean <vladimir.oltean@....com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Florian Fainelli <f.fainelli@...il.com>,
        Andrew Lunn <andrew@...n.ch>
Subject: Re: DSA: some questions regarding TX forwarding offload

On 10/5/21 3:29 PM, Vladimir Oltean wrote:
> On Tue, Oct 05, 2021 at 12:06:38PM +0000, Alvin Šipraga wrote:
>> The FDB isolation mechanism in my switch seems to be pretty good. As
>> long as I can pass along *some* information from the switch driver to
>> the tagging driver - namely the "allowance port mask" for a given bridge
>> - I think I should be able to achieve full isolation between up to 7
>> VLAN-aware bridges and with no restrictions on the number of VLANs per
>> bridge, nor on the sharing of VLANs per bridge.
>>
>> Here is a quick summary of the relevant behaviour of the switch:
>>
>> VLANs programmed on the switch can be set to either SVL or IVL, on a
>> per-VLAN basis. This affects how learned MAC addresses are searched
>> for/saved in the hardware FDB:
>>
>>     - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
>>     - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.
>>
>> EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.
>>
>> Unlike the FID - which is programmed per-VLAN - the EFID is programmed
>> per-port. When a port has learning enabled and it receives an ingress
>> frame with a given VID and MAC SA, it will search in the hardware FDB
>> with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if
>> the entry is not found, it will create a new one. This allows the switch
>> to learn the same {VID, MAC SA} pair on two separate ports, provided
>> those ports have different EFIDs.
> 
> So you are basically saying that for FDB lookups, the EFID is like a
> 3-bit extension of the 12-bit VID, practically emulating a 32K VLAN space?

On a high level yes, you can look at it that way. In practice I think 
the picture is a bit more nuanced, and that enabling IVL on a particular 
VLAN just means that the ASIC will do the VLAN<->FID mapping 
automatically. However I was too eager to state "no restrictions on the 
number of VLANs [...]" because in fact the FIDs of this chip are only 
4-bit. So if I use IVL, the limit will be 16 VLANs. However, I think the 
EFID strategy for FDB isolation still holds for SVL. Most of my 
understanding of the IVL/SVL feature is empirical, based on pushing 
packets and observing how the switch forwards them, and dumping the 
hardware FDB.

It could be that my conclusions about "lookup by VID" as opposed to 
"lookup by FID" are wrong, but if it comes to that, I will just have to 
manually implement VID<->FID mapping in the driver.

> 
>>> When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
>>> is not taken into consideration? Are packets flooded towards the entire
>>> set of ports in the allowance port mask that are members of VLAN N?
>>
>> My choice of words was imprecise. I do not mean to say that the MAC DA
>> is ignored. If there exists a suitable entry in the hardware FDB for
>> that MAC DA and VLAN n, the switch will _not_ flood the entire set of
>> ports in the allowance port mask that are members of VLAN n. Instead, it
>> _will_ respect the information contained the hardware FDB and forward
>> the packet only on the given port(s) in the FDB. Note that the switch
>> will still respect the allowance port mask, so this is something like:
>> forwarding_portmask = (fdb_portmask & allowance_portmask).
>                           ^^^^^^^^^^^^
>                           and this portion should take the EFID into
>                           consideration, should it not?

Correct. I have verified this behaviour.

> 
>>> I don't want to answer any of these questions until I understand how
>>> does your hardware intend the FID and FID_EN bits from the DSA header to
>>> be used. The FID only has 2 bits, so it is clear to me that it doesn't
>>> have the same understanding of the term as mv88e6xxx, if the Realtek
>>> switch has up to 4 FIDs while Marvell up to 4K.
>>
>> I came to the same conclusion until I started to play around with this,
>> only to discover that the Realtek documentation is wrong.
>>
>> First, so that we are on the same page, here again is the relevant part
>> of the CPU tag we are talking about:
>>
>> 0                                  7|8                                 15
>> |-----------------------------------+-----------------------------------|
>> |                               (16-bit)                                |
>> |                       Realtek EtherType [0x8899]                      |
>> |-----------------------------------+-----------------------------------|
>> |              (8-bit)              |              (8-bit)              |
>> |          Protocol [0x04]          |              REASON               |
>> |-----------------------------------+-----------------------------------|
>> |   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
>> | FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
>> |-----------------------------------+-----------------------------------|
>>    ^^^^^^^^^^^^^^^^^^^^
>>        look here
>>
>> What actually appears to be the case - at least in the IVL case I
>> described above - is that the fields FID_EN and FID should rather be
>> named EFID_EN and EFID. Moreover, the reserved bit X between the two
>> fields is actually an extension of the newly-named EFID field. So things
>> look more like this:
>>
>> |-----------------------------------+-----------------------------------|
>> |   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
>> | EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
>> |-----------------------------------+-----------------------------------|
>>    ^^^^^^^^^^^^^^^^^^^^
>>        look here
>>
>> If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port,
>> then the switch will learn the MAC SA of the frame and enter it into the
>> FDB with the corresponding VID (according to the 802.1Q tag) and the
>> corresponding EFID (according to the CPU tag and the EFID field). This
>> is super useful because it enables the strategy I outlined above, and
>> also avoids having to rely on the assisted_learning_on_cpu_port flag.
>>
>> What this doesn't do is help me with the actual forwarding decision of
>> the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a
>> "catch all" allowance port mask of 0 or ~0 (I tested both), the switch
>> would only consider forwarding the frame to ports with EFID == k. This
>> is not the case however, and it seems that the EFID_EN/EFID fields of
>> the CPU tag only affect how the switch learns from this frame. Hence my
>> need to "tune" the allowance port mask.
> 
> So a frame looks up the FDB twice, once on ingress for learning and once
> on egress for selecting the destination port mask.

This improves my mental model of the switch hardware, thanks.

> 
> Practically are you saying that the switch loses the EFID information
> between the ingress and the egress stage, since the destination port
> mask is selected based on a key constructed with "don't care" in the EFID bits?
> Strange.

Strange indeed - and wrong! I just checked this again. The switch 
actually _does_ preserve the EFID for the second lookup when selecting 
the destination port mask, and this behaves as you would expect. My 
observation to the contrary was specifically for the case where there is 
no hit for the destination address, in which case the switch will 
_flood_ according to the VLAN and MAC DA, without regard for the EFID. 
This kind of makes sense, since the EFID is just a searching/learning 
look-up-table concept and is not related to flooding. OTOH there are 
flooding port mask registers where one can set for 
{uni,multi,broad}cast, but this configuration is independent of VLAN.

This is OK, but I think it still puts me back in the same situation 
where I need to communicate a meaningful port allowance mask to avoid 
flooding the wrong ports.

> 
>> In case you are suspicious when I say the documentation is wrong: I
>> tested this behaviour quite heavily in order to come to this conclusion.
>> The EFID field is indeed 3 bits - and this matches with the definition
>> of EFID in the datasheet of the chip - and by setting it to some 3-bit
>> value like 7, I see this reflected in the hardware FDB after dumping it.
> 
> Ok, this sounds reasonable.
> 
>>> You should ask yourself not only how to prevent leakage, but also the
>>> flip side, how should I pass the packet to the switch in such a way that
>>> it will learn its MAC SA in the right FID, assuming that you go with FDB
>>> isolation first and figure that out. Once that question is answered, you
>>> can in premise specify an allowance port mask which is larger than
>>> needed (the entire mask of user ports) and the switch should only
>>> forward it towards the ports belonging to the same FID, which are
>>> roughly equivalent with the ports under a specific bridge. You can
>>> create a mapping between a FID and dp->bridge_num. Makes sense or am I
>>> completely off?
>>
>> Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to
>> discuss the situation with you because I know that you have some planned
>> changes for net-next and I was not sure if this method is considered
>> acceptable in DSA land. I hope the explanation above clarifies the
>> situation a bit. I will go ahead now and try to implement this mapping
>> between bridge_num and EFID, so that the tagging driver can look up the
>> correct allowance port mask on xmit of a bridge frame.
> 
> So in DSA we have a feature which is not very well liked, and you might
> understand the reason once I explain it.
> 
> DSA tagging protocols are constructed to be agnostic of the switch they
> are paired with. The reasoning was that in the beginning, sending and
> receiving a packet was a very simple operation, on receive you just
> needed to extract the information present in the DSA tag and select a
> net device based on simple info, and on xmit you just need to construct
> a simple tag based on the info provided to you (which port must be
> targeted). So there simply was no reason to couple them.
> 
> DSA also is quite the wild west in terms of hardware implementations.
> We make almost everything optional and the result is that some switches
> go on and implement super advanced features, and some lag behind.
> Testing becomes very hard, sometimes different users don't even have the
> same experience if they use hardware from different vendors.
>  From maintainers' point of view there is a desire to at least keep
> tagging protocols testable on generic hardware, for more than one
> reason. First you might want to analyze how a given Ethernet controller
> will behave when coupled with a DSA switch you don't have (stuff like
> hardware parsers come to mind, with implications for RFS/RSS), and you
> might simply want to debug stuff (we've had issues with VLANs dropped by
> the RX filter of the DSA master, when the switch used a tail tagging
> protocol; lack of Ethernet frame padding in certain conditions; DSA
> inheriting some master->vlan_features it shouldn't have, like TX TCP
> checksum offloading with tail taggers). In these cases, we did not have
> the hardware where the problem was reported, but nonetheless we could
> simulate an environment with the problem. We did that by using the
> mock-up dsa_loop driver, and hacking dsa_loop_get_protocol() to return
> the tagging protocol we wanted to test.
> 
> So generally speaking, whatever a tagging protocol driver may do, it
> should do something sensible when paired with dsa_loop too. We have come
> to depend on that lack of coupling.
> 
> On the other hand, there are situations where decoupled drivers won't
> really cut it. You can see this especially with the sja1105 driver and
> its tagging protocol, there is even a state machine implemented for PTP
> frames, with storage provided by the switch driver. I've no idea how
> come exactly the switches I directly maintain are the most quirky in
> this regard. Anyway.
> 
> So there exists a "void * dp->priv" pointer which you can use to share
> information between the switch driver and the tagging protocol driver.
> Be careful though, from the tagger's perspective, dp->priv may be
> provided by the switch driver you expect, or not (true, you would need
> to hack up a driver like dsa_loop to return DSA_TAG_PROTO_RTL8_4, but
> some crazy people like maintainers might do that). The compromise I've
> found (not ideal by any means) is to access dp->priv from code paths
> that cannot be triggered with dsa_loop. For example, since dsa_loop does
> not declare TX forwarding offload for the bridge, then you could
> dereference dp->priv inside an "if (skb->offload_fwd_mark)" on xmit and
> you could be fairly sure you are coupled with your switch.

Okay, so far so good - this is what I was working towards already.

> 
> Okay, so what can you put inside dp->priv? Mostly anything, as long as
> you have a basic understanding of the data persistence rules.
> The fast path races with the slow path and there is no locking between
> them. However, the particular case of xmitting a packet on behalf of the
> Linux bridge is protected by RCU:
> 
> - if the packet is sent through a socket opened on the bridge device
>    itself, br_dev_xmit() holds rcu_read_lock() for the entire duration,
>    including during br_dev_queue_push_xmit() which calls dev_queue_xmit
>    for the DSA interface.
> 
> - if the packet is not terminated locally but forwarded, the execution
>    context is that of the NET_RX softirq of the ingress interface, in
>    which the bridge rx_handler runs (br_handle_frame). And the receive
>    data path is under rcu_read_lock() too - see
>    netif_receive_skb_list_internal().
> 
> So in short, I think dsa_slave_xmit() runs under rcu_read_lock() under
> all circumstances of transmitting a frame on behalf of the bridge.
> 
> So you can provide RCU-protected data to the tagger in dp->priv.
> For example, you can create a reference-counted bridge structure in
> which you keep the allocation port mask as an unsigned long.
> I'm not too good with arch and compiler stuff, but I think unsigned long
> assignments are guaranteed to be atomic on any arch, so when the fast
> path reads that unsigned long it is guaranteed to not see an in-between
> value even if the slow path is updating it simultaneously.
> You create this bridge structure for the first port that enters a
> bridge, and you destroy it for the last port that exits it. All ports
> under the same bridge have a dp->priv that points to the same bridge
> structure. But you need to access it using rcu_dereference_pointer from
> the fast path, of course. And you need to make sure that you free it
> under RCU as well, otherwise existing readers might see it disappear
> from under their feet when the last port leaves a bridge.
> 
> Here is where things get a bit fuzzy. DSA gets notified of leaving a
> bridge through del_nbp -> netdev_upper_dev_unlink -> ... ->
> dsa_port_bridge_leave. If you look through what else del_nbp does,
> you'll see a call to nbp_vlan_flush() -> synchronize_rcu(). So it will
> wait for at least one grace period for all existing RCU readers to
> finish (including the xmit and rcv data paths) before that function
> returns. So in turn, you will see that netdev_upper_dev_unlink is only
> called after an RCU grace period, at that point there should be no xmit
> from that bridge port anymore, so you shouldn't see any
> skb->offload_fwd_mark = true during xmit racing with the actual
> dsa_port_bridge_leave.
> 
> But! if CONFIG_BRIDGE_VLAN_FILTERING=n, the shim implementation of
> nbp_vlan_flush() will kick in, and that doesn't call synchronize_rcu(),
> so AFAICS nobody will wait for a grace period. So please don't rely on
> the bridge waiting for you, and test with VLAN filtering compiled out
> too. Just put a synchronize_net() somewhere before freeing the bridge
> structure, or just use call_rcu()/kfree_rcu() to free it asynchronously,
> and you should be fine.
> 
> As for why rely on the bridge holding the rcu_read_lock(), but not on
> waiting? Well, because we kind of need to be atomic with the bridge
> itself, otherwise if the bridge releases rcu_read_lock() before
> dev_queue_xmit() we are no longer in the same atomic section, yet we see
> skb->offload_fwd_mark set to true. In the worst case, RCU might even
> report quiescent states on all CPUs in that meantime, and somebody might
> decide to free your bridge private data. Not nice.
> 
> I would make this allowance port mask be part of the Realtek driver
> first. If it proves to be useful more generally we could consider moving
> it to the DSA core.
> 
> Hope this helps and is not overwhelming.
> 

It's not overwhelming but very helpful indeed. Not that I understand 
everything you wrote now, but I will check out the numerous references 
you have made and see if I can figure this out. As usual I will be back 
on the mailing list if I have some further questions. Thanks a lot!

	Alvin