lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <88f37f14-54a5-4f9a-ca76-94a8e54769f2@blackwall.org>
Date:   Wed, 13 Jul 2022 07:06:40 +0300
From:   Nikolay Aleksandrov <razor@...ckwall.org>
To:     Jonathan Toppins <jtoppins@...hat.com>, netdev@...r.kernel.org
Cc:     Long Xin <lxin@...hat.com>, Jay Vosburgh <j.vosburgh@...il.com>,
        Veaceslav Falico <vfalico@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Jonathan Corbet <corbet@....net>, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [RESEND PATCH net-next v4] bond: add mac filter option for
 balance-xor

On 7/11/22 15:07, Jonathan Toppins wrote:
> Implement a MAC filter that prevents duplicate frame delivery when
> handling BUM traffic. This attempts to partially replicate OvS SLB
> Bonding[1] like functionality without requiring significant change
> in the Linux bridging code.
> 
> A typical network setup for this feature would be:
> 
>              .--------------------------------------------.
>              |         .--------------------.             |
>              |         |                    |             |
>         .-------------------.               |             |
>         |    | Bond 0  |    |               |             |
>         | .--'---. .---'--. |               |             |
>    .----|-| eth0 |-| eth1 |-|----.    .-----+----.   .----+------.
>    |    | '------' '------' |    |    | Switch 1 |   | Switch 2  |
>    |    '---,---------------'    |    |          +---+           |
>    |       /                     |    '----+-----'   '----+------'
>    |  .---'---.    .------.      |         |              |
>    |  |  br0  |----| VM 1 |      |      ~~~~~~~~~~~~~~~~~~~~~
>    |  '-------'    '------'      |     (                     )
>    |      |        .------.      |     ( Rest of Network     )
>    |      '--------| VM # |      |     (_____________________)
>    |               '------'      |
>    |  Host 1                     |
>    '-----------------------------'
> 
> Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
> bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
> can assume bond0, br1, and hosts VM1 to VM# are all contained in a
> single box, as depicted. Interfaces eth0 and eth1 provide redundant
> connections to the data center with the requirement to use all bandwidth
> when the system is functioning normally. Switch 1 and Switch 2 are
> physical switches that do not implement any advanced L2 management
> features such as MLAG, Cisco's VPC, or LACP.
> 
> Combining this feature with vlan+srcmac hash policy allows a user to
> create an access network without the need to use expensive switches that
> support features like Cisco's VCP.
> 
> [1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding
> 
> Co-developed-by: Long Xin <lxin@...hat.com>
> Signed-off-by: Long Xin <lxin@...hat.com>
> Signed-off-by: Jonathan Toppins <jtoppins@...hat.com>
> ---
> 
> Notes:
>      v2:
>       * dropped needless abstraction functions and put code in module init
>       * renamed variable "rc" to "ret" to stay consistent with most of the
>         code
>       * fixed parameter setting management, when arp-monitor is turned on
>         this feature will be turned off similar to how miimon and arp-monitor
>         interact
>       * renamed bond_xor_recv to bond_mac_filter_recv for a little more
>         clarity
>       * it appears the implied default return code for any bonding recv probe
>         must be `RX_HANDLER_ANOTHER`. Changed the default return code of
>         bond_mac_filter_recv to use this return value to not break skb
>         processing when the skb dev is switched to the bond dev:
>           `skb->dev = bond->dev`
>      
>      v3: Nik's comments
>       * clarified documentation
>       * fixed inline and basic reverse Christmas tree formatting
>       * zero'ed entry in mac_create
>       * removed read_lock taking in bond_mac_filter_recv
>       * made has_expired() atomic and removed critical sections
>         surrounding calls to has_expired(), this also removed the
>         use-after-free that would have occurred:
>             spin_lock_irqsave(&entry->lock, flags);
>                 if (has_expired(bond, entry))
>                     mac_delete(bond, entry);
>             spin_unlock_irqrestore(&entry->lock, flags); <---
>       * moved init/destroy of mac_filter_tbl to bond_open/bond_close
>         this removed the complex option dependencies, the only behavioural
>         change the user will see is if the bond is up and mac_filter is
>         enabled if they try and set arp_interval they will receive -EBUSY
>       * in bond_changelink moved processing of mac_filter option just below
>         mode processing
>      
>      v4:
>       * rebase to latest net-next
>       * removed rcu_read_lock() call in bond_mac_insert()
>       * used specific spin_lock_{}() calls instead of the irqsave version
>       * Outstanding comments from Nik:
>         https://lore.kernel.org/netdev/4c9db6ac-aa24-2ca2-3e44-18cfb23ac1bc@blackwall.org/
>         - old version of the patch still under discussion
>           https://lore.kernel.org/netdev/d2696dab-2490-feb5-ccb2-96906fc652f0@redhat.com/
>           * response: it has been over a month now and no new comments have come in
>             so I am assuming there is nothing left to the discussion

Not really, I didn't have time to adjust my solution, but it's not hard
to solve the multicast ingress. More below.

>         - What if anyone decides at some point 5 seconds are not enough or too much?
>           * response: I think making that configurable at a later time should not
>             prevent the inclusion of the initial feature. >         - This bit is pointless, you still have races even though not 
critical
>           * response: if there are races please point them out simply making a
>             comment about an enum doesn't show the race.

You had race conditions about the flag, now that you always take the 
lock you no longer have them, but also the flag is useless.

>         - This is not scalable at all, you get the lock even to update the
>           expire on *every* packet, most users go for L3 or L3+L4 and will hit
>           this with different flows on different CPUs.
>           * response: we take the lock to update the value, so that receive
>             traffic is not dropped. Further any implementation, bpf or nft,
>             will also have to take locks to update MAC entries. If there is a
>             specific locking order that will be less bad, I would appreciate
>             that discussion. Concerning L3 or L3+L4 hashing this is not the
>             data I have, in fact the most utilized hash is layer2.
> 

Yes, there are more scalable alternatives that are already implemented
in nft and eBPF and you can get them for free if you go with some of the
options below.

To the point, tbh I'm amazed we're still discussing applying this change :)

anyway here are my suggestions:

a) if you insist on adding some new code then at least add only
what is missing, i.e. add a xor mode option to limit ingress mcast only
to the active slave, then use the nftables mac filtering solution that I
gave you and you'll have the best of both worlds, you'll be able to dump
the current macs, edit them if needed, control the filtering time and
have the full nft flexibility, also it will scale much better than this

b) write a 3-4 line bash script that allows mcast only through the 
active slave, and again use the nft solution I gave you for the rest

c) extend the eBPF program I wrote to do mcast filtering, again it'll 
scale much better

(tentative) d) figure out a way to solve the mcast ingress entirely with 
nft, I suspect it's possible but I don't have time to waste to prove it

I don't plan on reviewing the patch in this form because I don't think
it should be applied at all since all of this can be easily implemented
with the available tools, but I won't officially nack it either.
That would be the maintainers' decision.

Cheers,
  Nik

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ