[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <95419cc0-9ef7-9956-3fe3-b4304a45be33@blackwall.org>
Date: Mon, 16 May 2022 20:24:53 +0300
From: Nikolay Aleksandrov <razor@...ckwall.org>
To: Jonathan Toppins <jtoppins@...hat.com>, netdev@...r.kernel.org
Cc: toke@...hat.com, Long Xin <lxin@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Jonathan Corbet <corbet@....net>,
Jay Vosburgh <j.vosburgh@...il.com>,
Veaceslav Falico <vfalico@...il.com>,
Andy Gospodarek <andy@...yhouse.net>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v3] bond: add mac filter option for balance-xor
On 16/05/2022 20:22, Nikolay Aleksandrov wrote:
> On 16/05/2022 17:06, Jonathan Toppins wrote:
>> On 5/15/22 02:32, Nikolay Aleksandrov wrote:
>>> On 15/05/2022 00:41, Nikolay Aleksandrov wrote:
>>>> On 13/05/2022 20:43, Jonathan Toppins wrote:
>>>>> Implement a MAC filter that prevents duplicate frame delivery when
>>>>> handling BUM traffic. This attempts to partially replicate OvS SLB
>>>>> Bonding[1] like functionality without requiring significant change
>>>>> in the Linux bridging code.
>>>>>
>>>>> A typical network setup for this feature would be:
>>>>>
>>>>> .--------------------------------------------.
>>>>> | .--------------------. |
>>>>> | | | |
>>>>> .-------------------. | |
>>>>> | | Bond 0 | | | |
>>>>> | .--'---. .---'--. | | |
>>>>> .----|-| eth0 |-| eth1 |-|----. .-----+----. .----+------.
>>>>> | | '------' '------' | | | Switch 1 | | Switch 2 |
>>>>> | '---,---------------' | | +---+ |
>>>>> | / | '----+-----' '----+------'
>>>>> | .---'---. .------. | | |
>>>>> | | br0 |----| VM 1 | | ~~~~~~~~~~~~~~~~~~~~~
>>>>> | '-------' '------' | ( )
>>>>> | | .------. | ( Rest of Network )
>>>>> | '--------| VM # | | (_____________________)
>>>>> | '------' |
>>>>> | Host 1 |
>>>>> '-----------------------------'
>>>>>
>>>>> Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
>>>>> bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
>>>>> can assume bond0, br1, and hosts VM1 to VM# are all contained in a
>>>>> single box, as depicted. Interfaces eth0 and eth1 provide redundant
>>>>> connections to the data center with the requirement to use all bandwidth
>>>>> when the system is functioning normally. Switch 1 and Switch 2 are
>>>>> physical switches that do not implement any advanced L2 management
>>>>> features such as MLAG, Cisco's VPC, or LACP.
>>>>>
>>>>> Combining this feature with vlan+srcmac hash policy allows a user to
>>>>> create an access network without the need to use expensive switches that
>>>>> support features like Cisco's VCP.
>>>>>
>>>>> [1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
>>>>>
>>>>> Co-developed-by: Long Xin <lxin@...hat.com>
>>>>> Signed-off-by: Long Xin <lxin@...hat.com>
>>>>> Signed-off-by: Jonathan Toppins <jtoppins@...hat.com>
>>>>> ---
>>>>>
>>>>> Notes:
>>>>> v2:
>>>>> * dropped needless abstraction functions and put code in module init
>>>>> * renamed variable "rc" to "ret" to stay consistent with most of the
>>>>> code
>>>>> * fixed parameter setting management, when arp-monitor is turned on
>>>>> this feature will be turned off similar to how miimon and arp-monitor
>>>>> interact
>>>>> * renamed bond_xor_recv to bond_mac_filter_recv for a little more
>>>>> clarity
>>>>> * it appears the implied default return code for any bonding recv probe
>>>>> must be `RX_HANDLER_ANOTHER`. Changed the default return code of
>>>>> bond_mac_filter_recv to use this return value to not break skb
>>>>> processing when the skb dev is switched to the bond dev:
>>>>> `skb->dev = bond->dev`
>>>>> v3: Nik's comments
>>>>> * clarified documentation
>>>>> * fixed inline and basic reverse Christmas tree formatting
>>>>> * zero'ed entry in mac_create
>>>>> * removed read_lock taking in bond_mac_filter_recv
>>>>> * made has_expired() atomic and removed critical sections
>>>>> surrounding calls to has_expired(), this also removed the
>>>>> use-after-free that would have occurred:
>>>>> spin_lock_irqsave(&entry->lock, flags);
>>>>> if (has_expired(bond, entry))
>>>>> mac_delete(bond, entry);
>>>>> spin_unlock_irqrestore(&entry->lock, flags); <---
>>>>> * moved init/destroy of mac_filter_tbl to bond_open/bond_close
>>>>> this removed the complex option dependencies, the only behavioural
>>>>> change the user will see is if the bond is up and mac_filter is
>>>>> enabled if they try and set arp_interval they will receive -EBUSY
>>>>> * in bond_changelink moved processing of mac_filter option just below
>>>>> mode processing
>>>>>
>>>>> Documentation/networking/bonding.rst | 20 +++
>>>>> drivers/net/bonding/Makefile | 2 +-
>>>>> drivers/net/bonding/bond_mac_filter.c | 201 ++++++++++++++++++++++++++
>>>>> drivers/net/bonding/bond_mac_filter.h | 37 +++++
>>>>> drivers/net/bonding/bond_main.c | 30 ++++
>>>>> drivers/net/bonding/bond_netlink.c | 13 ++
>>>>> drivers/net/bonding/bond_options.c | 81 +++++++++--
>>>>> drivers/net/bonding/bonding_priv.h | 1 +
>>>>> include/net/bond_options.h | 1 +
>>>>> include/net/bonding.h | 3 +
>>>>> include/uapi/linux/if_link.h | 1 +
>>>>> 11 files changed, 373 insertions(+), 17 deletions(-)
>>>>> create mode 100644 drivers/net/bonding/bond_mac_filter.c
>>>>> create mode 100644 drivers/net/bonding/bond_mac_filter.h
>>>>>
>>>>
>>> [snip]
>>>
>>> The same problem solved using a few nftables rules (in case you don't want to load eBPF):
>>> $ nft 'add table netdev nt'
>>> $ nft 'add chain netdev nt bond0EgressFilter { type filter hook egress device bond0 priority 0; }'
>>> $ nft 'add chain netdev nt bond0IngressFilter { type filter hook ingress device bond0 priority 0; }'
>>> $ nft 'add set netdev nt macset { type ether_addr; flags timeout; }'
>>> $ nft 'add rule netdev nt bond0EgressFilter set update ether saddr timeout 5s @macset'
>>> $ nft 'add rule netdev nt bond0IngressFilter ether saddr @macset counter drop'
>>>
>>
>> I get the following when trying to apply this on a fedora 35 install.
>>
>> root@...ora ~]# ip link add bond0 type bond mode balance-xor xmit_hash_policy vlan+srcmac
>> [root@...ora ~]# nft 'add table netdev nt'
>> [root@...ora ~]# nft 'add chain netdev nt bond0EgressFilter { type filter hook egress device bond0 priority 0; }'
>> Error: unknown chain hook
>> add chain netdev nt bond0EgressFilter { type filter hook egress device bond0 priority 0; }
>> ^^^^^^
>> [root@...ora ~]# uname -a
>> Linux fedora 5.17.5-200.fc35.x86_64 #1 SMP PREEMPT Thu Apr 28 15:41:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>>
>
> Well, take it up with the Fedora nftables package maintainer. :)
>
> Your nftables version is old (I'd guess <1.0.1):
> commit 510c4fad7e78
> Author: Lukas Wunner <lukas@...ner.de>
> Date: Wed Mar 11 13:20:06 2020 +0100
>
> src: Support netdev egress hook
>
> $ git tag --contains 510c4fad7e78f
> v1.0.1
> v1.0.2
>
> I just tested it[1] on Linux 5.16.18-200.fc35.x86_64 #1 SMP PREEMPT Mon Mar 28 14:10:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>
> Cheers,
> Nik
>
> [1] You can clearly see the dynamically learned mac on egress (52:54:00:23:5f:13) and traffic
> with that source is now blocked on ingress.
>
> $ nft -a list table netdev nt
> set macset { # handle 10
> type ether_addr
> size 65535
> flags timeout
> elements = { 52:54:00:23:5f:13 timeout 5s expires 4s192ms }
> }
>
> chain bond0EgressFilter { # handle 8
> type filter hook egress device "bond0" priority filter; policy accept;
> update @macset { ether saddr timeout 5s } # handle 11
> }
>
> chain bond0IngressFilter { # handle 9
> type filter hook ingress device "bond0" priority filter; policy accept;
> }
>
Oops, pasted an old list before adding the ingress rule:
chain bond0IngressFilter { # handle 9
type filter hook ingress device "bond0" priority filter; policy accept;
ether saddr @macset counter packets 120 bytes 7680 drop # handle 15
}
Powered by blists - more mailing lists