linux-kernel - Re: understanding switchdev notifications

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d3244cef-9c6b-4bba-b184-4139f12224df@alliedtelesis.co.nz>
Date: Thu, 15 Aug 2024 10:18:23 +1200
From: Chris Packham <chris.packham@...iedtelesis.co.nz>
To: Tobias Waldekranz <tobias@...dekranz.com>, netdev
 <netdev@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: understanding switchdev notifications

Hi Tobias,

On 14/08/24 18:54, Tobias Waldekranz wrote:
> On tor, aug 08, 2024 at 12:48, Chris Packham <chris.packham@...iedtelesis.co.nz> wrote:
>> Hi,
>>
>> I'm trying to get to grips with how the switchdev notifications are
>> supposed to be used when developing a switchdev driver.
>>
>> I have been reading through
>> https://www.kernel.org/doc/html/latest/networking/switchdev.html which
>> covers a few things but doesn't go into detail around the notifiers that
>> one needs to implement for a new switchdev driver (which is probably
>> very dependent on what the hardware is capable of).
>>
>> Specifically right now I'm looking at having a switch port join a vlan
>> aware bridge. I have a configuration something like this
>>
>>       ip link add br0 type bridge vlan_filtering 1
>>       ip link set sw1p5 master br0
>>       ip link set sw1p1 master br0
>>       bridge vlan add vid 2 dev br0 self
>>       ip link add link br0 br0.2 type vlan id 2
>>       ip addr add dev br0.2 192.168.2.1/24
>>       bridge vlan add vid 2 dev lan5 pvid untagged
>>       bridge vlan add vid 2 dev lan1
>>       ip link set sw1p5 up
>>       ip link set sw1p1 up
>>       ip link set br0 up
>>       ip link set br0.2 up
>>
>> Then I'm testing by sending a ping to a nonexistent host on the
>> 192.168.2.0/24 subnet and looking at the traffic with tcpdump on another
>> device connected to sw1p5.
>>
>> I'm a bit confused about how I should be calling
>> switchdev_bridge_port_offload(). It takes two netdevs (brport_dev and
>> dev) but as far as I've been able to see all the callers end up passing
>> the same netdev for both of these (some create a driver specific brport
>> but this still ends up with brport->dev and dev being the same object).
> In the simple case when a switchport is directly attached to a bridge,
> brport_dev and dev will be the same. If the attachment is indirect, via
> a bond for example, they will differ:
>
>         br0
>         /
>      bond0
>     /    \
> sw1p1  sw1p5
>
> In the setup above, the bridge has no reference to any sw*p* interfaces,
> all generated notifications will reference "bond0". By including the
> switchdev port in the message back to the bridge, it can perform
> validation on the setup; e.g. that bond0 is not made up of interfaces
> from different hardware domains.

Ah that makes sense. I haven't got to bonds yet so I hadn't hit that case.

>> I've figured out that I need to set tx_fwd_offload=true so that the
>> bridge software only sends one packet to the hardware. That makes sense
>> as a way of saying the my hardware can take care of sending the packet
>> out the right ports.
>>
>> I do have a problem that what I get from the bridge has a vlan tag
>> inserted (which makes sense in sw when the packet goes from br0.2 to
>> br0). But I don't actually need it as the hardware will insert a tag for
>> me if the port is setup for egress tagging. I can shuffle the Ethernet
>> header up but I was wondering if there was a way of telling the bridge
>> not to insert the tag?
> Signaling tx_fwd_offload=true means assuming responsibility for
> delivering each packet to all ports that the bridge would otherwise have
> sent individual skbs for.
>
> Let's expand your setup slightly, and see why you need the tag:
>
>     br0.2 br0.3
>         \ /
>         br0
>        / |  \
>       /  |   \
> sw1p1 sw1p3  sw1p5
> (2U)  (3U)  (2T,3T)
>
> sw1p5 is now a trunk. We can trigger an ARP broadcast to be sent out
> either via br0.2 or br0.3, depending on the subnet we choose to target.
>
> Your driver will receive a single skb to transmit, and skb->dev can be
> set to any of sw1p{1,3,5} depending on config order, FDB entries
> (i.e. the order of previously received packets) etc., and is thus
> nondeterministic.
>
> So presumably, even though you might need to remove the 802.1Q tag from
> the frame, you need some way of tagging the packet with the correct VID
> in order for the hardware to do the right thing; possibly via a field in
> the vendor's hardware specific tag.

I did eventually find NETIF_F_HW_VLAN_CTAG_TX which stops the packet 
data coming down to the switch driver with a vlan tag inserted. The 
intended egress vlan is still available via skb_vlan_tag_get_id() so I 
can add it to hardware specific tag (which for me is part of the TX DMA 
descriptor) and I don't need to shuffle any bytes around which is great.

>> Finally I'm confused about the atomic_nb/atomic_nb parameters. Some
>> drivers just pass NULL and others pass the same notifier blocks that
>> they've already registered with
>> register_switchdev_notifier()/register_switchdev_notifier(). If
>> notifiers are registered why does switchdev_bridge_port_offload() take
>> them as parameters?
> Because when you add a port to the bridge, lots of stuff that you want
> to offload might already have been configured. E.g., imagine that you
> were to add vlan 2 to br0 before adding the switchports; then you
> probably need those events to be replayed to the new ports in order to
> add your CPU-facing switchport to vlan 2. However, we do not want to
> bother existing bridge members with duplicated events (and risk messing
> up any reference counters they might maintain for these
> objects). Therefore we bypass the standard notifier calls and "unicast"
> the replay events only to the driver for the port being added.

This part I still don't get. I understand that there may be scenarios 
where switchdev decides it needs to unicast events to a specific device. 
But why does the caller of switchdev_bridge_port_offload() need to make 
that distinction?

>> Thanks,
>> Chris