[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d9c3666-4b29-17e6-1b65-8c64c5eed726@nvidia.com>
Date: Tue, 26 Oct 2021 15:20:03 +0300
From: Nikolay Aleksandrov <nikolay@...dia.com>
To: Vladimir Oltean <vladimir.oltean@....com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Roopa Prabhu <roopa@...dia.com>,
Ido Schimmel <idosch@...dia.com>,
Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Andrew Lunn <andrew@...n.ch>,
Florian Fainelli <f.fainelli@...il.com>,
Vivien Didelot <vivien.didelot@...il.com>,
Vladimir Oltean <olteanv@...il.com>,
Jiri Pirko <jiri@...dia.com>
Subject: Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del
from switchdev to the bridge
On 26/10/2021 14:25, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>> Hi,
>> Interesting way to work around the asynchronous notifiers. :) I went over
>> the patch-set and given that we'll have to support and maintain this fragile
>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>> with delayed learning from the fast-path to get a sleepable context where we can
>> use synchronous switchdev calls and get feedback immediately.
>
> Delayed learning means that we'll receive a sequence of packets like this:
>
> br0--------\
> / \ \
> / \ \
> / \ \
> swp0 swp1 swp2
> | | |
> station A station B station C
>
> station A sends request to B, station B sends reply to A.
> Since the learning of station A's MAC SA races with the reply sent by
> station B, it now becomes theoretically possible for the reply packet to
> be flooded to station C as well, right? And that was not possible before
> (at least assuming an ageing time longer than the round-trip time of these packets).
>
> And that will happen regardless of whether switchdev is used or not.
> I don't want to outright dismiss this (maybe I don't fully understand
> this either), but it seems like a pretty heavy-handed change.
>
It will depending on lock contention, I plan to add a fast/uncontended case with
trylock from fast-path and if that fails then queue the fdb, but yes - in general
you are correct that the traffic could get flooded in the queue case before the delayed
learning processes the entry, it's a trade off if we want sleepable learning context.
Ido noted privately that's usually how hw acts anyway, also if people want guarantees
that the reply won't get flooded there are other methods to achieve that (ucast flood
disable, firewall rules etc). Today the reply could get flooded if the entry can't be programmed
as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
but still there haven't been any such guarantees. I think it's generally a good improvement and
will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
and reason about in the future.
>> That would be the
>> cleanest and most straight-forward solution, it'd be less error-prone and easier
>> to maintain long term. I plan to convert the bridge hash_lock to a mutex and then
>> you can do the synchronous switchdev change if you don't mind and agree of course.
>
> I agree that there are races and implications I haven't fully thought of,
> with this temporary dropping of the br->hash_lock. It doesn't appear ideal.
>
> For example,
>
> /* Delete an FDB entry and notify switchdev. */
> static int __br_fdb_delete(struct net_bridge *br,
> const struct net_bridge_port *p,
> const u8 *addr, u16 vlan,
> struct netlink_ext_ack *extack)
> {
> struct br_switchdev_fdb_wait_ctx wait_ctx;
> struct net_bridge_fdb_entry *fdb;
> int err;
>
> br_switchdev_fdb_wait_ctx_init(&wait_ctx);
>
> spin_lock_bh(&br->hash_lock);
>
> fdb = br_fdb_find(br, addr, vlan);
> if (!fdb || READ_ONCE(fdb->dst) != p) {
> spin_unlock_bh(&br->hash_lock);
> return -ENOENT;
> }
>
> br_fdb_notify_async(br, fdb, RTM_DELNEIGH, extack, &wait_ctx);
>
> spin_unlock_bh(&br->hash_lock);
>
> err = br_switchdev_fdb_wait(&wait_ctx); <- at this stage (more comments below)
> if (err)
> return err;
>
> /* We've notified rtnl and switchdev once, don't do it again,
> * just delete.
> */
> return fdb_delete_by_addr_and_port(br, p, addr, vlan, false);
> }
>
> the software FDB still contains the entry, while the hardware doesn't.
> And we are no longer holding the lock, so somebody can either add or
> delete that entry.
>
> If somebody else tries to concurrently add that entry, it should not
> notify switchdev again because it will see that the FDB entry exists,
> and we should finally end up deleting it and result in a consistent
> state.
>
> If somebody else tries to concurrently delete that entry, it will
> probably be from a code path that ignores errors (because the code paths
> that don't are serialized by the rtnl_mutex). Switchdev will say "hey,
> but I don't have this FDB entry, you've just deleted it", but that will
> again be fine.
>
> There seems to be a problem if somebody concurrently deletes that entry,
> _and_then_ it gets added back again, all that before we call
> fdb_delete_by_addr_and_port(). Because we don't notify switchdev the
> second time around, we'll end up with an address in hardware but no
> software counterpart.
>
> I don't really know how to cleanly deal with that.
>
Right, I'm sure new cases will come up and it won't be easy to reason about these
races when changes need to be made. I'd rather stick to a more straight-forward
and simpler approach.
>> By the way patches 1-6 can stand on their own, feel free to send them separately.
>
> Thanks, I will.
>
Powered by blists - more mailing lists