lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211005132912.bkjogbbv7gpf6ue7@skbuf>
Date:   Tue, 5 Oct 2021 13:29:13 +0000
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Alvin Šipraga <ALSI@...g-olufsen.dk>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Florian Fainelli <f.fainelli@...il.com>,
        Andrew Lunn <andrew@...n.ch>
Subject: Re: DSA: some questions regarding TX forwarding offload

On Tue, Oct 05, 2021 at 12:06:38PM +0000, Alvin Šipraga wrote:
> The FDB isolation mechanism in my switch seems to be pretty good. As
> long as I can pass along *some* information from the switch driver to
> the tagging driver - namely the "allowance port mask" for a given bridge
> - I think I should be able to achieve full isolation between up to 7
> VLAN-aware bridges and with no restrictions on the number of VLANs per
> bridge, nor on the sharing of VLANs per bridge.
>
> Here is a quick summary of the relevant behaviour of the switch:
>
> VLANs programmed on the switch can be set to either SVL or IVL, on a
> per-VLAN basis. This affects how learned MAC addresses are searched
> for/saved in the hardware FDB:
>
>    - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
>    - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.
>
> EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.
>
> Unlike the FID - which is programmed per-VLAN - the EFID is programmed
> per-port. When a port has learning enabled and it receives an ingress
> frame with a given VID and MAC SA, it will search in the hardware FDB
> with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if
> the entry is not found, it will create a new one. This allows the switch
> to learn the same {VID, MAC SA} pair on two separate ports, provided
> those ports have different EFIDs.

So you are basically saying that for FDB lookups, the EFID is like a
3-bit extension of the 12-bit VID, practically emulating a 32K VLAN space?

> > When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
> > is not taken into consideration? Are packets flooded towards the entire
> > set of ports in the allowance port mask that are members of VLAN N?
>
> My choice of words was imprecise. I do not mean to say that the MAC DA
> is ignored. If there exists a suitable entry in the hardware FDB for
> that MAC DA and VLAN n, the switch will _not_ flood the entire set of
> ports in the allowance port mask that are members of VLAN n. Instead, it
> _will_ respect the information contained the hardware FDB and forward
> the packet only on the given port(s) in the FDB. Note that the switch
> will still respect the allowance port mask, so this is something like:
> forwarding_portmask = (fdb_portmask & allowance_portmask).
                         ^^^^^^^^^^^^
                         and this portion should take the EFID into
                         consideration, should it not?

> > I don't want to answer any of these questions until I understand how
> > does your hardware intend the FID and FID_EN bits from the DSA header to
> > be used. The FID only has 2 bits, so it is clear to me that it doesn't
> > have the same understanding of the term as mv88e6xxx, if the Realtek
> > switch has up to 4 FIDs while Marvell up to 4K.
>
> I came to the same conclusion until I started to play around with this,
> only to discover that the Realtek documentation is wrong.
>
> First, so that we are on the same page, here again is the relevant part
> of the CPU tag we are talking about:
>
> 0                                  7|8                                 15
> |-----------------------------------+-----------------------------------|
> |                               (16-bit)                                |
> |                       Realtek EtherType [0x8899]                      |
> |-----------------------------------+-----------------------------------|
> |              (8-bit)              |              (8-bit)              |
> |          Protocol [0x04]          |              REASON               |
> |-----------------------------------+-----------------------------------|
> |   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
> | FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
> |-----------------------------------+-----------------------------------|
>   ^^^^^^^^^^^^^^^^^^^^
>       look here
>
> What actually appears to be the case - at least in the IVL case I
> described above - is that the fields FID_EN and FID should rather be
> named EFID_EN and EFID. Moreover, the reserved bit X between the two
> fields is actually an extension of the newly-named EFID field. So things
> look more like this:
>
> |-----------------------------------+-----------------------------------|
> |   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
> | EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
> |-----------------------------------+-----------------------------------|
>   ^^^^^^^^^^^^^^^^^^^^
>       look here
>
> If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port,
> then the switch will learn the MAC SA of the frame and enter it into the
> FDB with the corresponding VID (according to the 802.1Q tag) and the
> corresponding EFID (according to the CPU tag and the EFID field). This
> is super useful because it enables the strategy I outlined above, and
> also avoids having to rely on the assisted_learning_on_cpu_port flag.
>
> What this doesn't do is help me with the actual forwarding decision of
> the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a
> "catch all" allowance port mask of 0 or ~0 (I tested both), the switch
> would only consider forwarding the frame to ports with EFID == k. This
> is not the case however, and it seems that the EFID_EN/EFID fields of
> the CPU tag only affect how the switch learns from this frame. Hence my
> need to "tune" the allowance port mask.

So a frame looks up the FDB twice, once on ingress for learning and once
on egress for selecting the destination port mask.

Practically are you saying that the switch loses the EFID information
between the ingress and the egress stage, since the destination port
mask is selected based on a key constructed with "don't care" in the EFID bits?
Strange.

> In case you are suspicious when I say the documentation is wrong: I
> tested this behaviour quite heavily in order to come to this conclusion.
> The EFID field is indeed 3 bits - and this matches with the definition
> of EFID in the datasheet of the chip - and by setting it to some 3-bit
> value like 7, I see this reflected in the hardware FDB after dumping it.

Ok, this sounds reasonable.

> > You should ask yourself not only how to prevent leakage, but also the
> > flip side, how should I pass the packet to the switch in such a way that
> > it will learn its MAC SA in the right FID, assuming that you go with FDB
> > isolation first and figure that out. Once that question is answered, you
> > can in premise specify an allowance port mask which is larger than
> > needed (the entire mask of user ports) and the switch should only
> > forward it towards the ports belonging to the same FID, which are
> > roughly equivalent with the ports under a specific bridge. You can
> > create a mapping between a FID and dp->bridge_num. Makes sense or am I
> > completely off?
>
> Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to
> discuss the situation with you because I know that you have some planned
> changes for net-next and I was not sure if this method is considered
> acceptable in DSA land. I hope the explanation above clarifies the
> situation a bit. I will go ahead now and try to implement this mapping
> between bridge_num and EFID, so that the tagging driver can look up the
> correct allowance port mask on xmit of a bridge frame.

So in DSA we have a feature which is not very well liked, and you might
understand the reason once I explain it.

DSA tagging protocols are constructed to be agnostic of the switch they
are paired with. The reasoning was that in the beginning, sending and
receiving a packet was a very simple operation, on receive you just
needed to extract the information present in the DSA tag and select a
net device based on simple info, and on xmit you just need to construct
a simple tag based on the info provided to you (which port must be
targeted). So there simply was no reason to couple them.

DSA also is quite the wild west in terms of hardware implementations.
We make almost everything optional and the result is that some switches
go on and implement super advanced features, and some lag behind.
Testing becomes very hard, sometimes different users don't even have the
same experience if they use hardware from different vendors.
From maintainers' point of view there is a desire to at least keep
tagging protocols testable on generic hardware, for more than one
reason. First you might want to analyze how a given Ethernet controller
will behave when coupled with a DSA switch you don't have (stuff like
hardware parsers come to mind, with implications for RFS/RSS), and you
might simply want to debug stuff (we've had issues with VLANs dropped by
the RX filter of the DSA master, when the switch used a tail tagging
protocol; lack of Ethernet frame padding in certain conditions; DSA
inheriting some master->vlan_features it shouldn't have, like TX TCP
checksum offloading with tail taggers). In these cases, we did not have
the hardware where the problem was reported, but nonetheless we could
simulate an environment with the problem. We did that by using the
mock-up dsa_loop driver, and hacking dsa_loop_get_protocol() to return
the tagging protocol we wanted to test.

So generally speaking, whatever a tagging protocol driver may do, it
should do something sensible when paired with dsa_loop too. We have come
to depend on that lack of coupling.

On the other hand, there are situations where decoupled drivers won't
really cut it. You can see this especially with the sja1105 driver and
its tagging protocol, there is even a state machine implemented for PTP
frames, with storage provided by the switch driver. I've no idea how
come exactly the switches I directly maintain are the most quirky in
this regard. Anyway.

So there exists a "void * dp->priv" pointer which you can use to share
information between the switch driver and the tagging protocol driver.
Be careful though, from the tagger's perspective, dp->priv may be
provided by the switch driver you expect, or not (true, you would need
to hack up a driver like dsa_loop to return DSA_TAG_PROTO_RTL8_4, but
some crazy people like maintainers might do that). The compromise I've
found (not ideal by any means) is to access dp->priv from code paths
that cannot be triggered with dsa_loop. For example, since dsa_loop does
not declare TX forwarding offload for the bridge, then you could
dereference dp->priv inside an "if (skb->offload_fwd_mark)" on xmit and
you could be fairly sure you are coupled with your switch.

Okay, so what can you put inside dp->priv? Mostly anything, as long as
you have a basic understanding of the data persistence rules.
The fast path races with the slow path and there is no locking between
them. However, the particular case of xmitting a packet on behalf of the
Linux bridge is protected by RCU:

- if the packet is sent through a socket opened on the bridge device
  itself, br_dev_xmit() holds rcu_read_lock() for the entire duration,
  including during br_dev_queue_push_xmit() which calls dev_queue_xmit
  for the DSA interface.

- if the packet is not terminated locally but forwarded, the execution
  context is that of the NET_RX softirq of the ingress interface, in
  which the bridge rx_handler runs (br_handle_frame). And the receive
  data path is under rcu_read_lock() too - see
  netif_receive_skb_list_internal().

So in short, I think dsa_slave_xmit() runs under rcu_read_lock() under
all circumstances of transmitting a frame on behalf of the bridge.

So you can provide RCU-protected data to the tagger in dp->priv.
For example, you can create a reference-counted bridge structure in
which you keep the allocation port mask as an unsigned long.
I'm not too good with arch and compiler stuff, but I think unsigned long
assignments are guaranteed to be atomic on any arch, so when the fast
path reads that unsigned long it is guaranteed to not see an in-between
value even if the slow path is updating it simultaneously.
You create this bridge structure for the first port that enters a
bridge, and you destroy it for the last port that exits it. All ports
under the same bridge have a dp->priv that points to the same bridge
structure. But you need to access it using rcu_dereference_pointer from
the fast path, of course. And you need to make sure that you free it
under RCU as well, otherwise existing readers might see it disappear
from under their feet when the last port leaves a bridge.

Here is where things get a bit fuzzy. DSA gets notified of leaving a
bridge through del_nbp -> netdev_upper_dev_unlink -> ... ->
dsa_port_bridge_leave. If you look through what else del_nbp does,
you'll see a call to nbp_vlan_flush() -> synchronize_rcu(). So it will
wait for at least one grace period for all existing RCU readers to
finish (including the xmit and rcv data paths) before that function
returns. So in turn, you will see that netdev_upper_dev_unlink is only
called after an RCU grace period, at that point there should be no xmit
from that bridge port anymore, so you shouldn't see any
skb->offload_fwd_mark = true during xmit racing with the actual
dsa_port_bridge_leave.

But! if CONFIG_BRIDGE_VLAN_FILTERING=n, the shim implementation of
nbp_vlan_flush() will kick in, and that doesn't call synchronize_rcu(),
so AFAICS nobody will wait for a grace period. So please don't rely on
the bridge waiting for you, and test with VLAN filtering compiled out
too. Just put a synchronize_net() somewhere before freeing the bridge
structure, or just use call_rcu()/kfree_rcu() to free it asynchronously,
and you should be fine.

As for why rely on the bridge holding the rcu_read_lock(), but not on
waiting? Well, because we kind of need to be atomic with the bridge
itself, otherwise if the bridge releases rcu_read_lock() before
dev_queue_xmit() we are no longer in the same atomic section, yet we see
skb->offload_fwd_mark set to true. In the worst case, RCU might even
report quiescent states on all CPUs in that meantime, and somebody might
decide to free your bridge private data. Not nice.

I would make this allowance port mask be part of the Realtek driver
first. If it proves to be useful more generally we could consider moving
it to the DSA core.

Hope this helps and is not overwhelming.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ