[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1845d832-e36c-fff1-0ea6-1d7aa290919f@linux.ibm.com>
Date: Thu, 31 Mar 2022 14:07:20 +0200
From: Alexandra Winter <wintera@...ux.ibm.com>
To: Nikolay Aleksandrov <razor@...ckwall.org>,
Jay Vosburgh <jay.vosburgh@...onical.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>,
Paolo Abeni <pabeni@...hat.com>,
Hangbin Liu <liuhangbin@...il.com>, netdev@...r.kernel.org,
linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
Roopa Prabhu <roopa@...dia.com>,
bridge@...ts.linux-foundation.org,
Ido Schimmel <idosch@...dia.com>, Jiri Pirko <jiri@...dia.com>
Subject: Re: [PATCH net-next v2] veth: Support bonding events
On 31.03.22 12:33, Nikolay Aleksandrov wrote:
> On 31/03/2022 12:59, Alexandra Winter wrote:
>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>> Bonding drivers generate specific events during failover that trigger
>>> switch updates. When a veth device is attached to a bridge with a
>>> bond interface, we want external switches to learn about the veth
>>> devices as well.
>>>
>>> Example:
>>>
>>> | veth_a2 | veth_b2 | veth_c2 |
>>> ------o-----------o----------o------
>>> \ | /
>>> o o o
>>> veth_a1 veth_b1 veth_c1
>>> -------------------------
>>> | bridge |
>>> -------------------------
>>> bond0
>>> / \
>>> eth0 eth1
>>>
>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>> external hardware attached to eth1.
>>
>> On 30.03.22 21:15, Jay Vosburgh wrote:
>>> Jakub Kicinski <kuba@...nel.org> wrote:
>>>
>>>> On Wed, 30 Mar 2022 19:16:42 +0300 Nikolay Aleksandrov wrote:
>>>>>> Maybe opt-out? But assuming the event is only generated on
>>>>>> active/backup switch over - when would it be okay to ignore
>>>>>> the notification?
>>>>>
>>>>> Let me just clarify, so I'm sure I've not misunderstood you. Do you mean opt-out as in
>>>>> make it default on? IMO that would be a problem, large scale setups would suddenly
>>>>> start propagating it to upper devices which would cause a lot of unnecessary bcast.
>>>>> I meant enable it only if needed, and only on specific ports (second part is not
>>>>> necessary, could be global, I think it's ok either way). I don't think any setup
>>>>> which has many upper vlans/macvlans would ever enable this.
>>>>
>>>> That may be. I don't have a good understanding of scenarios in which
>>>> GARP is required and where it's not :) Goes without saying but the
>>>> default should follow the more common scenario.
>>>
>>> At least from the bonding failover persective, the GARP is
>>> needed when there's a visible topology change (so peers learn the new
>>> path), a change in MAC address, or both. I don't think it's possible to
>>> determine from bonding which topology changes are visible, so any
>>> failover gets a GARP. The original intent as best I recall was to cover
>>> IP addresses configured on the bond itself or on VLANs above the bond.
>>>
>>> If I understand the original problem description correctly, the
>>> bonding failover causes the connectivity issue because the network
>>> segments beyond the bond interfaces don't share forwarding information
>>> (i.e., they are completely independent). The peer (end station or
>>> switch) at the far end of those network segments (where they converge)
>>> is unable to directly see that the "to bond eth0" port went down, and
>>> has no way to know that anything is awry, and thus won't find the new
>>> path until an ARP or forwarding entry for "veth_a2" (from the original
>>> diagram) times out at the peer out in the network.
>>>
>>>>>>> My concern was about the Hangbin's alternative proposal to notify all
>>>>>>> bridge ports. I hope in my porposal I was able to avoid infinite loops.
>>>>>>
>>>>>> Possibly I'm confused as to where the notification for bridge master
>>>>>> gets sent..
>>>>>
>>>>> IIUC it bypasses the bridge and sends a notify peers for the veth peer so it would
>>>>> generate a grat arp (inetdev_event -> NETDEV_NOTIFY_PEERS).
>>>>
>>>> Ack, I was basically repeating the question of where does
>>>> the notification with dev == br get generated.
>>>>
>>>> There is a protection in this patch to make sure the other
>>>> end of the veth is not plugged into a bridge (i.e. is not
>>>> a bridge port) but there can be a macvlan on top of that
>>>> veth that is part of a bridge, so IIUC that check is either
>>>> insufficient or unnecessary.
>>>
>>> I'm a bit concerned this is becoming a interface plumbing
>>> topology change whack-a-mole.
>>>
>>> In the above, what if the veth is plugged into a bridge, and
>>> there's a end station on that bridge? If it's bridges all the way down,
>>> where does the need for some kind of TCN mechanism stop?
>>>
>>> Or instead of a veth it's an physical network hop (perhaps a
>>> tunnel; something through which notifiers do not propagate) to another
>>> host with another bridge, then what?
>>>
>>> -J
>>>
>>> ---
>>> -Jay Vosburgh, jay.vosburgh@...onical.com
>>
>> I see 3 technologies that are used for network virtualization in combination with bond for redundancy
>> (and I may miss some):
>> (1) MACVTAP/MACVLAN over bond:
>> MACVLAN propagates notifiers from bond to endpoints (same as VLAN)
>> (drivers/net/macvlan.c:
>> case NETDEV_NOTIFY_PEERS:
>> case NETDEV_BONDING_FAILOVER:
>> case NETDEV_RESEND_IGMP:
>> /* Propagate to all vlans */
>> list_for_each_entry(vlan, &port->vlans, list)
>> call_netdevice_notifiers(event, vlan->dev);
>> })
>> (2) OpenVSwitch:
>> OVS seems to have its own bond implementation, but sends out reverse Arp on active-backup failover
>> (3) User defined bridge over bond:
>> propagates notifiers to the bridge device itself, but not to the devices attached to bridge ports.
>> (net/bridge/br.c:
>> case NETDEV_RESEND_IGMP:
>> /* Propagate to master device */
>> call_netdevice_notifiers(event, br->dev);)
>>
>> Active-backup may not be the best bonding mode, but it is a simple way to achieve redundancy and I've seen it being used.
>> I don't see a usecase for MACVLAN over bridge over bond (?)
>
> If you're talking about this particular case (network virtualization) - sure. But macvlans over bridges
> are heavily used in Cumulus Linux and large scale setups. For example VRRP is implemented using macvlan
> devices. Any notification that propagates to the bridge and reaches these would cause a storm of broadcasts
> being sent down which would not scale and is extremely undesirable in general.
>
>> The external HW network does not need to be updated about the instances that are conencted via tunnel,
>> so I don't see an issue there.
>>
>> I had this idea how to solve the failover issue it for veth pairs attached to the user defined bridge.
>> Does this need to be configurable? How? Per veth pair?
>
> That is not what I meant (if you were referring to my comment), I meant if it gets implemented in the
> bridge and it starts propagating the notify peers notifier - that _must_ be configurable.
>
>>
>> Of course a more general solution how bridge over bond could handle notifications, would be great,
>> but I'm running out of ideas. So I thought I'd address veth first.
>> Your help and ideas are highly appreciated, thank you.
>
> I'm curious why it must be done in the kernel altogether? This can obviously be solved in user-space
> by sending grat arps towards flapped por for fdbs on other ports (e.g. veths) based on a netlink notification.
> In fact based on your description propagating NETDEV_NOTIFY_PEERS to bridge ports wouldn't help
> because in that case the remote peer veth will not generate a grat arp. The notification will
> get propagated only to local veth (bridge port), or the bridge itself depending on implementation.
>
> So from bridge perspective, if you decide to pursue a kernel solution, I think you'll need
> a new bridge port option which acts on NOTIFY_PEERS and generates a grat arp for all fdbs
> on the port where it is enabled to the port which generated the NOTIFY_PEERS. Note that is
> also fragile as I'm sure some stacked device config would not work, so I want to re-iterate
> how much easier it is to solve it in user-space which has better visibility and you can
> change much faster to accommodate new use cases.
>
> To illustrate: bond
> \
> bridge
> /
> veth0
> |
> veth1
>
> When bond generates NOTIFY_PEERS, and you have this new option enabled on veth0 then
> the bridge should generate grat arps for all fdbs on veth0 towards bond so the new
> path would learn them. Note that is very dangerous as veth1 can generate thousands
> of fdbs and you can potentially DDoS the whole network, so again I'd advise to do
> this in user-space where you can better control it.
>
> W.r.t to this patch, I think it will also work and will cause a single grat arp which
> is ok. Just need to make sure loops are not possible, for example I think you can loop
> your implementation by the following config (untested theory):
> bond
> \
> bridge
> \ \
> veth2.10 veth0 - veth1
> \ \
> \ veth1.10 (vlan)
> \ \
> \ bridge2
> \ /
> veth2 - veth3
>
>
> 1. bond generates NOTIFY_PEERS
> 2. bridge propagates to veth1 (through veth0 port)
> 3. veth1 propagates to its vlan (veth1.10)
> 4. bridge2 sees veth1.10 NOTIFY_PEERS and propagates to veth2 (through veth3 port)
> 5. veth2 propagates to its vlan (veth2.10)
> 6. veth2.10 propagates it back to bridge
> <loop>
>
> I'm sure similar setup, and maybe even simpler, can be constructed with other devices
> which can propagate or generate NOTIFY_PEERS.
>
> Cheers,
> Nik
>
Thank you very much Nik for your advice and your thourough explanations.
I think I could prevent the loop you describe above. But I agree that propagating
notifications through veth is more risky, because there is no upper-lower relationship.
Is there interest in a v3 of my patch, where I would also incorporate Jakub's comments?
Otherwise I would next explore a user-space solution like Nik proposed.
Thank you all very much
Alexandra
Powered by blists - more mailing lists