netdev - Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <09fb8f2e-7a8b-4dc9-63f7-5255d5c6bd30@nvidia.com>
Date:   Wed, 27 Oct 2021 12:36:02 +0300
From:   Nikolay Aleksandrov <nikolay@...dia.com>
To:     Vladimir Oltean <vladimir.oltean@....com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Roopa Prabhu <roopa@...dia.com>,
        Ido Schimmel <idosch@...dia.com>,
        Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Vladimir Oltean <olteanv@...il.com>,
        Jiri Pirko <jiri@...dia.com>
Subject: Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del
 from switchdev to the bridge

On 27/10/2021 12:20, Nikolay Aleksandrov wrote:
> On 27/10/2021 01:27, Nikolay Aleksandrov wrote:
>> On 27/10/2021 00:51, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
>>>> On 26/10/2021 22:01, Vladimir Oltean wrote:
>>>>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>>>>>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>>>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>>>>>
>>>>>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>>>>>
>>>>>>>>>             br0--------\
>>>>>>>>>           /    \        \
>>>>>>>>>          /      \        \
>>>>>>>>>         /        \        \
>>>>>>>>>      swp0         swp1    swp2
>>>>>>>>>       |            |        |
>>>>>>>>>    station A   station B  station C
>>>>>>>>>
>>>>>>>>> station A sends request to B, station B sends reply to A.
>>>>>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>>>>>> be flooded to station C as well, right? And that was not possible before
>>>>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>>>>>
>>>>>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>>>>>
>>>>>>> I wonder why mutex_trylock has this comment?
>>>>>>>
>>>>>>>  * This function must not be used in interrupt context. The
>>>>>>>  * mutex must be released by the same task that acquired it.
>>>>>>>
>>>>>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>>>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>>>>>> disable, firewall rules etc).
>>>>>>>
>>>>>>> Not all hardware is like that, the switches I'm working with, which
>>>>>>> perform autonomous learning, all complete the learning process for a
>>>>>>> frame strictly before they start the forwarding process. The software
>>>>>>> bridge also behaves like that. My only concern is that we might start
>>>>>>> building on top of some fundamental bridge changes like these, which
>>>>>>> might risk a revert a few months down the line, when somebody notices
>>>>>>> and comes with a use case where that is not acceptable.
>>>>>>>
>>>>>>
>>>>>> I should've clarified I was talking about Spectrum as Ido did in a reply.
>>>>>
>>>>> Ido also said "I assume Spectrum is not special in this regard" and I
>>>>> just wanted to say this such that we don't end with the wrong impression.
>>>>> Special or not, to the software bridge it would be new behavior, which
>>>>> I can only hope will be well received.
>>>>>
>>>>>>>> Today the reply could get flooded if the entry can't be programmed
>>>>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>>>>>> and reason about in the future.
>>>>>>>
>>>>>>> You're the maintainer, I certainly won't complain if we go down this path.
>>>>>>> It would be nice if br->lock can also be transformed into a mutex, it
>>>>>>> would make all of switchdev much simpler.
>>>>>>>
>>>>>>
>>>>>> That is why we are discussing possible solutions, I don't want to force anything
>>>>>> but rather reach a consensus about the way forward. There are complexities involved with
>>>>>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>>>>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>>>>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>>>>>> to close the above scenario's window as much as possible. But as I said I think that we can get
>>>>>> these correct and well defined, after that we'll be able to reason about future changes and
>>>>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>>>>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>>>>>> view of the issues once I start converting the bridge, but having straight-forward locking
>>>>>> rules and known deterministic behaviour sounds like a better long term plan.
>>>>>
>>>>> Sorry, I might have to read the code, I don't want to misinterpret your
>>>>> idea. With what you're describing here, I think that what you are trying
>>>>> to achieve is to both have it our way, and preserve the current behavior
>>>>> for the common case, where learning still happens from the fast path.
>>>>> But I'm not sure that this is the correct goal, especially if you also
>>>>> add "straightforward locking rules" to the mix.
>>>>>
>>>>
>>>> The trylock was only an optimization idea, but yes you'd need both synchronous
>>>> and asynchronous notifiers. I don't think you need sleepable context when
>>>> learning from softirq, who would you propagate the error to? It is important
>>>> when entries are being added from user-space, but if you'd like to have veto
>>>> option from softirq then we can scratch the trylock idea altogether.
>>>
>>> I'll let Ido answer here. As I said, the model I'm working with is that
>>> of autonomous learning, so for me, no. Whereas the Spectrum model is
>>> that of secure learning. I expect that it'd be pretty useless to set up
>>> software assisted secure learning if you're just going to say yes and
>>> learn all addresses anyway. I've never seen Spectrum documentation, but
>>> I would be shocked if it wouldn't be able to be configured to operate in
>>> the bare-bones autonomous learning mode too.
>>>
>>
>> Ack, got it.
>>
>>>> Let's not focus on the trylock, it's a minor detail.
>>>>
>>>>> I think you'd have to explain what is the purpose of the fast path trylock
>>>>> logic you've mentioned above. So in the uncontended br->hash_lock case,
>>>>> br_fdb_update() could take that mutex and then what? It would create and
>>>>> add the FDB entry, and call fdb_notify(), but that still can't sleep.
>>>>> So if switchdev drivers still need to privately defer the offloading
>>>>> work, we still need some crazy completion-based mechanism to report
>>>>> errors like the one I posted here, because in the general sense,
>>>>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
>>>>
>>>> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,
>>>
>>> Just when I was about to say that I'm happy to get rid of some of those
>>> workqueues, and now you're telling me not only we might not get rid of
>>> them, but we might also end up with a second, separate implementation :)
>>>
>>> Anyway, let's not put the cart before the horses, let's see what the
>>> realities of the bridge data path learning and STP flushing will teach
>>> us about what can and can't be done.
>>>
>>
>> We will get rid of some workqueues (I hope), I was saying that if do the trylock we
>> might have to add both sync and async, otherwise we just need a single sync one.
>>
>>>> but that optimization is probably not worth the complexity of
>>>> maintaining both so we can just drop the trylock.
>>>>
>>>>>
>>>>> And how do you queue a deletion, do you delete the FDB from the pending
>>>>> and the main hash table, or do you just create a deletion command to be
>>>>> processed in deferred context?
>>>>>
>>>>
>>>> All commands which cannot take the mutex directly will be executed from a delayed
>>>> queue. We also need a delayed flush operation because we need to flush from STP code
>>>> and we can't sleep there due to the STP spinlock. The pending table can be considered
>>>> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
>>>> case, so user-space adds and deletes go only the main table.
>>>>
>>>>> And since you'd be operating on the hash table concurrently from the
>>>>> deferred work and from the fast path, doesn't this mean you'll need to
>>>>> use some sort of spin_lock_bh from the deferred work, which again would
>>>>> incur atomic context when you want to notify switchdev? Because with the
>>>>> mutex_trylock described above, you'd gain exclusivity to the main hash
>>>>> table, but if you lose, what you need is exclusivity to the pending hash
>>>>> table. So the deferred context also needs to be atomic when it dequeues
>>>>> the pending FDB entries and notifies them. I just don't see what we're
>>>>> winning, how the rtnetlink functions will be any different for the better.
>>>>>
>>>>
>>>> The rbtree can be fully taken by the delayed action and swapped with a new one,
>>>> so no exclusivity needed for the notifications. It will take the spinlock, get
>>>> the whole tree and release the lock, same if it was a simple queue.
>>>
>>> I'm quite unfamiliar with this technique, atomically swapping a queue
>>> pointer with a new empty one, and emptying that queue while unlocked.
>>> Do you have any reference implementation for this kind of technique?
>>>
>>
>> the delayed work would be doing something similar to:
>>
>> spin_lock(delay_lock);
>> record tree root
>> rcu change tree root with empty
>> spin_unlock(delay_lock);
>>
>> mutex_lock(br->hash_lock);
>> process recorded (old) tree and free items via rcu
>> mutex_unlock(br->hash_lock);
>>
>> We have the same pattern with queues all around the kernel.
>>
>>>> The spinlock for the rbtree for the delayed learning is necessary in all cases,
>>>
>>> "in all cases" means "regardless of whether we try from the fast path to
>>> make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
>>> insert into a temporary rbtree for pending entries"? I don't understand
>>> this: why would you take the rbtree spinlock if you've inserted into the
>>> main hash table already?
>>>
>>
>> No, it means that we need the spinlock to protect the delayed queue.
>>
>>> I'm only concerned about spin locks we'd have to hold while calling
>>> fdb_notify().
>>>
>>
>> There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from
>> sleepable context with the mutex held, that's the goal at least.
>>
>>>> if we used the trylock fast learn then we could directly insert the entry if we win, but
>>>> again lets just always use delayed ops from atomic contexts as a start.
>>>>
>>>> All entries from atomic contexts will be added to an rbtree which will be processed
>>>> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
>>>> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
>>>> described above.
>>>>
>>>> We win having sleepable context for user-space calls and being able to do synchronous
>>>> calls to the drivers to wait for the errors.
>>>
>>> I think I'd really have to see the code at this point, sorry.
>>>
>>
>> Sure, I'll prepare an RFC version this week.
>>
>> Thanks for the comments.
>>
> 
> OK, I've started working on converting the bridge but there are a few more downsides to
> delaying notifications that I really don't like and there is no good clean way around them.
> 
> a) after conversion we need to queue an event (read take spinlock) for each roaming of an fdb and flag change
>    which are now lockless
> 
> b) we either lose transient fdb changes and send only the last notification for that fdb, or we have to snapshot
>    the whole fdb on every change
> 
> c) we need to serialize and order actions, so in addition to the rbtree for fdb lookup, we'll need an ordered list of events
> 
> Now b) is most worrisome and really problematic, anything from user-space which is following that fdb's changes
> would miss events in the first case (not an option), or we'll have to deal with the complexity of snapshotting
> entries which in itself has a few problems and doesn't really extend well to other objects (snapshotting an mdb
> could be a nightmare). I don't see a clean way forward to fix this as I don't want to do the completion async
> waiting either, that would impose future race and locking issues and could make future changes harder to get right.
> 
> We have to explore alternative options, until we have something viable which doesn't play locking games
> we'll have to live with the current situation.
> 
> I'll keep thinking about this and will try a few ideas.
> 
> Thanks,
>  Nik
> 

To add - I'll be experimenting with turning hash_lock into a semaphore so we can use while(down_trylock()) from
the fast-path and down() from process context to be able to sleep. For this we'll still need to support
two notifier flavors obviously - atomic and blocking, but the locking rules are kept simple. We won't
have error feedback for the atomic cases, but I think that's the best we can do with the current situation.