netdev - Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0517da5-359b-7d2b-39a5-10e1f681749a@nvidia.com>
Date:   Wed, 27 Oct 2021 01:27:47 +0300
From:   Nikolay Aleksandrov <nikolay@...dia.com>
To:     Vladimir Oltean <vladimir.oltean@....com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Roopa Prabhu <roopa@...dia.com>,
        Ido Schimmel <idosch@...dia.com>,
        Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Vladimir Oltean <olteanv@...il.com>,
        Jiri Pirko <jiri@...dia.com>
Subject: Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del
 from switchdev to the bridge

On 27/10/2021 00:51, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
>> On 26/10/2021 22:01, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>>>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>> Hi,
>>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>>>
>>>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>>>
>>>>>>>             br0--------\
>>>>>>>           /    \        \
>>>>>>>          /      \        \
>>>>>>>         /        \        \
>>>>>>>      swp0         swp1    swp2
>>>>>>>       |            |        |
>>>>>>>    station A   station B  station C
>>>>>>>
>>>>>>> station A sends request to B, station B sends reply to A.
>>>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>>>> be flooded to station C as well, right? And that was not possible before
>>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>>>
>>>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>>>
>>>>>>
>>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>>>
>>>>> I wonder why mutex_trylock has this comment?
>>>>>
>>>>>  * This function must not be used in interrupt context. The
>>>>>  * mutex must be released by the same task that acquired it.
>>>>>
>>>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>>>> disable, firewall rules etc).
>>>>>
>>>>> Not all hardware is like that, the switches I'm working with, which
>>>>> perform autonomous learning, all complete the learning process for a
>>>>> frame strictly before they start the forwarding process. The software
>>>>> bridge also behaves like that. My only concern is that we might start
>>>>> building on top of some fundamental bridge changes like these, which
>>>>> might risk a revert a few months down the line, when somebody notices
>>>>> and comes with a use case where that is not acceptable.
>>>>>
>>>>
>>>> I should've clarified I was talking about Spectrum as Ido did in a reply.
>>>
>>> Ido also said "I assume Spectrum is not special in this regard" and I
>>> just wanted to say this such that we don't end with the wrong impression.
>>> Special or not, to the software bridge it would be new behavior, which
>>> I can only hope will be well received.
>>>
>>>>>> Today the reply could get flooded if the entry can't be programmed
>>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>>>> and reason about in the future.
>>>>>
>>>>> You're the maintainer, I certainly won't complain if we go down this path.
>>>>> It would be nice if br->lock can also be transformed into a mutex, it
>>>>> would make all of switchdev much simpler.
>>>>>
>>>>
>>>> That is why we are discussing possible solutions, I don't want to force anything
>>>> but rather reach a consensus about the way forward. There are complexities involved with
>>>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>>>> to close the above scenario's window as much as possible. But as I said I think that we can get
>>>> these correct and well defined, after that we'll be able to reason about future changes and
>>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>>>> view of the issues once I start converting the bridge, but having straight-forward locking
>>>> rules and known deterministic behaviour sounds like a better long term plan.
>>>
>>> Sorry, I might have to read the code, I don't want to misinterpret your
>>> idea. With what you're describing here, I think that what you are trying
>>> to achieve is to both have it our way, and preserve the current behavior
>>> for the common case, where learning still happens from the fast path.
>>> But I'm not sure that this is the correct goal, especially if you also
>>> add "straightforward locking rules" to the mix.
>>>
>>
>> The trylock was only an optimization idea, but yes you'd need both synchronous
>> and asynchronous notifiers. I don't think you need sleepable context when
>> learning from softirq, who would you propagate the error to? It is important
>> when entries are being added from user-space, but if you'd like to have veto
>> option from softirq then we can scratch the trylock idea altogether.
> 
> I'll let Ido answer here. As I said, the model I'm working with is that
> of autonomous learning, so for me, no. Whereas the Spectrum model is
> that of secure learning. I expect that it'd be pretty useless to set up
> software assisted secure learning if you're just going to say yes and
> learn all addresses anyway. I've never seen Spectrum documentation, but
> I would be shocked if it wouldn't be able to be configured to operate in
> the bare-bones autonomous learning mode too.
> 

Ack, got it.

>> Let's not focus on the trylock, it's a minor detail.
>>
>>> I think you'd have to explain what is the purpose of the fast path trylock
>>> logic you've mentioned above. So in the uncontended br->hash_lock case,
>>> br_fdb_update() could take that mutex and then what? It would create and
>>> add the FDB entry, and call fdb_notify(), but that still can't sleep.
>>> So if switchdev drivers still need to privately defer the offloading
>>> work, we still need some crazy completion-based mechanism to report
>>> errors like the one I posted here, because in the general sense,
>>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
>>
>> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,
> 
> Just when I was about to say that I'm happy to get rid of some of those
> workqueues, and now you're telling me not only we might not get rid of
> them, but we might also end up with a second, separate implementation :)
> 
> Anyway, let's not put the cart before the horses, let's see what the
> realities of the bridge data path learning and STP flushing will teach
> us about what can and can't be done.
> 

We will get rid of some workqueues (I hope), I was saying that if do the trylock we
might have to add both sync and async, otherwise we just need a single sync one.

>> but that optimization is probably not worth the complexity of
>> maintaining both so we can just drop the trylock.
>>
>>>
>>> And how do you queue a deletion, do you delete the FDB from the pending
>>> and the main hash table, or do you just create a deletion command to be
>>> processed in deferred context?
>>>
>>
>> All commands which cannot take the mutex directly will be executed from a delayed
>> queue. We also need a delayed flush operation because we need to flush from STP code
>> and we can't sleep there due to the STP spinlock. The pending table can be considered
>> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
>> case, so user-space adds and deletes go only the main table.
>>
>>> And since you'd be operating on the hash table concurrently from the
>>> deferred work and from the fast path, doesn't this mean you'll need to
>>> use some sort of spin_lock_bh from the deferred work, which again would
>>> incur atomic context when you want to notify switchdev? Because with the
>>> mutex_trylock described above, you'd gain exclusivity to the main hash
>>> table, but if you lose, what you need is exclusivity to the pending hash
>>> table. So the deferred context also needs to be atomic when it dequeues
>>> the pending FDB entries and notifies them. I just don't see what we're
>>> winning, how the rtnetlink functions will be any different for the better.
>>>
>>
>> The rbtree can be fully taken by the delayed action and swapped with a new one,
>> so no exclusivity needed for the notifications. It will take the spinlock, get
>> the whole tree and release the lock, same if it was a simple queue.
> 
> I'm quite unfamiliar with this technique, atomically swapping a queue
> pointer with a new empty one, and emptying that queue while unlocked.
> Do you have any reference implementation for this kind of technique?
> 

the delayed work would be doing something similar to:

spin_lock(delay_lock);
record tree root
rcu change tree root with empty
spin_unlock(delay_lock);

mutex_lock(br->hash_lock);
process recorded (old) tree and free items via rcu
mutex_unlock(br->hash_lock);

We have the same pattern with queues all around the kernel.

>> The spinlock for the rbtree for the delayed learning is necessary in all cases,
> 
> "in all cases" means "regardless of whether we try from the fast path to
> make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
> insert into a temporary rbtree for pending entries"? I don't understand
> this: why would you take the rbtree spinlock if you've inserted into the
> main hash table already?
> 

No, it means that we need the spinlock to protect the delayed queue.

> I'm only concerned about spin locks we'd have to hold while calling
> fdb_notify().
> 

There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from
sleepable context with the mutex held, that's the goal at least.

>> if we used the trylock fast learn then we could directly insert the entry if we win, but
>> again lets just always use delayed ops from atomic contexts as a start.
>>
>> All entries from atomic contexts will be added to an rbtree which will be processed
>> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
>> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
>> described above.
>>
>> We win having sleepable context for user-space calls and being able to do synchronous
>> calls to the drivers to wait for the errors.
> 
> I think I'd really have to see the code at this point, sorry.
> 

Sure, I'll prepare an RFC version this week.

Thanks for the comments.