[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7f0bb43d-9943-2f0a-f4c2-3d4c0e55e69d@blackwall.org>
Date: Wed, 30 Mar 2022 14:25:24 +0300
From: Nikolay Aleksandrov <razor@...ckwall.org>
To: Alexandra Winter <wintera@...ux.ibm.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>,
Paolo Abeni <pabeni@...hat.com>,
Hangbin Liu <liuhangbin@...il.com>, netdev@...r.kernel.org,
linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
Roopa Prabhu <roopa@...dia.com>,
bridge@...ts.linux-foundation.org,
Ido Schimmel <idosch@...dia.com>, Jiri Pirko <jiri@...dia.com>,
Jay Vosburgh <j.vosburgh@...il.com>
Subject: Re: [PATCH net-next v2] veth: Support bonding events
On 30/03/2022 14:14, Alexandra Winter wrote:
>
>
> On 30.03.22 12:23, Nikolay Aleksandrov wrote:
>> On 30/03/2022 03:54, Jakub Kicinski wrote:
>>> Dropping the BPF people from CC and adding Hangbin, bridge and
>>> bond/team. Please exercise some judgment when sending patches.
> Thank you.
> I did 'scripts/get_maintainer.pl drivers/net/veth.c'
> but was a bit surprised about the outcome.
>
>>>
>>> On Tue, 29 Mar 2022 13:40:52 +0200 Alexandra Winter wrote:
>>>> Bonding drivers generate specific events during failover that trigger
>>>> switch updates. When a veth device is attached to a bridge with a
>>>> bond interface, we want external switches to learn about the veth
>>>> devices as well.
>>>>
>>>> Example:
>>>>
>>>> | veth_a2 | veth_b2 | veth_c2 |
>>>> ------o-----------o----------o------
>>>> \ | /
>>>> o o o
>>>> veth_a1 veth_b1 veth_c1
>>>> -------------------------
>>>> | bridge |
>>>> -------------------------
>>>> bond0
>>>> / \
>>>> eth0 eth1
>>>>
>>>> In case of failover from eth0 to eth1, the netdev_notifier needs to be
>>>> propagated, so e.g. veth_a2 can re-announce its MAC address to the
>>>> external hardware attached to eth1.
>>>>
>>>> Without this patch we have seen cases where recovery after bond failover
>>>> took an unacceptable amount of time (depending on timeout settings in the
>>>> network).
>>>>
>>>> Due to the symmetric nature of veth special care is required to avoid
>>>> endless notification loops. Therefore we only notify from a veth
>>>> bridgeport to a peer that is not a bridgeport.
>>>>
>>>> References:
>>>> Same handling as for macvlan:
>>>> commit 4c9912556867 ("macvlan: Support bonding events")
>>>> and vlan:
>>>> commit 4aa5dee4d999 ("net: convert resend IGMP to notifier event")
>>>>
>>>> Alternatives:
>>>> Propagate notifier events to all ports of a bridge. IIUC, this was
>>>> rejected in https://www.spinics.net/lists/netdev/msg717292.html
>>>
>>> My (likely flawed) reading of Nik's argument was that (1) he was
>>> concerned about GARP storms; (2) he didn't want the GARP to be
>>> broadcast to all ports, just the bond that originated the request.
>>>
>>
>> Yes, that would be ideal. Trying to avoid unnecessary bcasts, that is
>> especially important for large setups with lots of devices.
>
> One way to target the bond that originated the request, would be if the
> bridge itself would do GARPs/RARPS/.., on this bond port for all MACs
> that are in its FDB. What do you think about that?
>
That's a hack and you can already do it easily in user-space, you don't need
anything special in the bridge. It is also very specific, and it should only
happen in certain situations (e.g. a/b flap) which the bridge doesn't really
know about, but user-space does because it can see the notifications and
can see the bond mode.
>>
>>> I'm not sure I follow (1), as Hangbin said the event is rare, plus
>>> GARP only comes from interfaces that have an IP addr, which IIUC
>>> most bridge ports will not have.
>>>
>>
>> Indeed, such setups are not the most common ones.
>>
>>> This patch in no way addresses (2). But then, again, if we put
>>> a macvlan on top of a bridge master it will shotgun its GARPS all
>>> the same. So it's not like veth would be special in that regard.
>>>
>>> Nik, what am I missing?
>>>
>>
>> If we're talking about macvlan -> bridge -> bond then the bond flap's
>> notify peers shouldn't reach the macvlan. Generally broadcast traffic
>> is quite expensive for the bridge, I have patches that improve on the
>> technical side (consider ports only for the same bcast domain), but you also
>> wouldn't want unnecessary bcast packets being sent around. :)
>> There are setups with tens of bond devices and propagating that to all would be
>> very expensive, but most of all unnecessary. It would also hurt setups with
>> a lot of vlan devices on the bridge. There are setups with hundreds of vlans
>> and hundreds of macvlans on top, propagating it up would send it to all of
>> them and that wouldn't scale at all, these mostly have IP addresses too.
>>
>> Perhaps we can enable propagation on a per-port or per-bridge basis, then we
>> can avoid these walks. That is, make it opt-in.
>>
>>>> It also seems difficult to avoid re-bouncing the notifier.
>>>
>>> syzbot will make short work of this patch, I think the potential
>>> for infinite loops has to be addressed somehow. IIUC this is the
>>> first instance of forwarding those notifiers to a peer rather
>>> than within a upper <> lower device hierarchy which is a DAG.
>
> My concern was about the Hangbin's alternative proposal to notify all
> bridge ports. I hope in my porposal I was able to avoid infinite loops.
> >>>
>>>> Signed-off-by: Alexandra Winter <wintera@...ux.ibm.com>
>>>> ---
>>>> drivers/net/veth.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 53 insertions(+)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index d29fb9759cc9..74b074453197 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -1579,6 +1579,57 @@ static void veth_setup(struct net_device *dev)
>>>> dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>>>> }
>>>>
>>>> +static bool netif_is_veth(const struct net_device *dev)
>>>> +{
>>>> + return (dev->netdev_ops == &veth_netdev_ops);
>>>
>>> brackets unnecessary
>>>
>>>> +}
>>>> +
>>>> +static void veth_notify_peer(unsigned long event, const struct net_device *dev)
>>>> +{
>>>> + struct net_device *peer;
>>>> + struct veth_priv *priv;
>>>> +
>>>> + priv = netdev_priv(dev);
>>>> + peer = rtnl_dereference(priv->peer);
>>>> + /* avoid re-bounce between 2 bridges */
>>>> + if (!netif_is_bridge_port(peer))
>>>> + call_netdevice_notifiers(event, peer);
>>>> +}
>>>> +
>>>> +/* Called under rtnl_lock */
>>>> +static int veth_device_event(struct notifier_block *unused,
>>>> + unsigned long event, void *ptr)
>>>> +{
>>>> + struct net_device *dev, *lower;
>>>> + struct list_head *iter;
>>>> +
>>>> + dev = netdev_notifier_info_to_dev(ptr);
>>>> +
>>>> + switch (event) {
>>>> + case NETDEV_NOTIFY_PEERS:
>>>> + case NETDEV_BONDING_FAILOVER:
>>>> + case NETDEV_RESEND_IGMP:
>>>> + /* propagate to peer of a bridge attached veth */
>>>> + if (netif_is_bridge_master(dev)) {
>>>
>>> Having veth sift thru bridge ports seems strange.
>>> In fact it could be beneficial to filter the event based on
>>> port state (whether it's forwarding, vlan etc). But looking
>>> at details of port state outside the bridge would be even stranger.
>>>
>>>> + iter = &dev->adj_list.lower;
>>>> + lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>> + while (lower) {
>>>> + if (netif_is_veth(lower))
>>>> + veth_notify_peer(event, lower);
>>>> + lower = netdev_next_lower_dev_rcu(dev, &iter);
>>>
>>> let's add netdev_for_each_lower_dev_rcu() rather than open-coding
>>>
>>>> + }
>>>> + }
>>>> + break;
>>>> + default:
>>>> + break;
>>>> + }
>>>> + return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static struct notifier_block veth_notifier_block __read_mostly = {
>>>> + .notifier_call = veth_device_event,
>>>
>>> extra tab here
>>>
>>>> +};
>>>> +
>>>> /*
>>>> * netlink interface
>>>> */
>>>> @@ -1824,12 +1875,14 @@ static struct rtnl_link_ops veth_link_ops = {
>>>>
>>>> static __init int veth_init(void)
>>>> {
>>>> + register_netdevice_notifier(&veth_notifier_block);
>>>
>>> this can fail
>>>
>>>> return rtnl_link_register(&veth_link_ops);
>>>> }
>>>>
>>>> static __exit void veth_exit(void)
>>>> {
>>>> rtnl_link_unregister(&veth_link_ops);
>>>> + unregister_netdevice_notifier(&veth_notifier_block);
>>>> }
>>>>
>>>> module_init(veth_init);
>>>
>>
Powered by blists - more mailing lists