netdev - Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210822174449.nby7gmrjhwv3walp@skbuf>
Date:   Sun, 22 Aug 2021 20:44:49 +0300
From:   Vladimir Oltean <olteanv@...il.com>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     Nikolay Aleksandrov <nikolay@...dia.com>,
        Vladimir Oltean <vladimir.oltean@....com>,
        netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Roopa Prabhu <roopa@...dia.com>, Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Vadym Kochan <vkochan@...vell.com>,
        Taras Chornyi <tchornyi@...vell.com>,
        Jiri Pirko <jiri@...dia.com>, Ido Schimmel <idosch@...dia.com>,
        UNGLinuxDriver@...rochip.com,
        Grygorii Strashko <grygorii.strashko@...com>,
        Marek Behun <kabel@...ckhole.sk>,
        DENG Qingfang <dqfext@...il.com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Hauke Mehrtens <hauke@...ke-m.de>,
        Woojung Huh <woojung.huh@...rochip.com>,
        Sean Wang <sean.wang@...iatek.com>,
        Landen Chao <Landen.Chao@...iatek.com>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Alexandre Belloni <alexandre.belloni@...tlin.com>,
        George McCollister <george.mccollister@...il.com>,
        Ioana Ciornei <ioana.ciornei@....com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky <leon@...nel.org>,
        Lars Povlsen <lars.povlsen@...rochip.com>,
        Steen Hegelund <Steen.Hegelund@...rochip.com>,
        Julian Wiedmann <jwi@...ux.ibm.com>,
        Karsten Graul <kgraul@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Ivan Vecera <ivecera@...hat.com>,
        Vlad Buslov <vladbu@...dia.com>,
        Jianbo Liu <jianbol@...dia.com>,
        Mark Bloch <mbloch@...dia.com>, Roi Dayan <roid@...dia.com>,
        Tobias Waldekranz <tobias@...dekranz.com>,
        Vignesh Raghavendra <vigneshr@...com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>
Subject: Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
 blocking

On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> >    deferred by drivers even from code paths that are initially blocking
> >    (are running in process context):
> > 
> > br_fdb_add
> > -> __br_fdb_add
> >    -> fdb_add_entry
> >       -> fdb_notify
> >          -> br_switchdev_fdb_notify
> > 
> >     It seems fairly trivial to move the fdb_notify call outside of the
> >     atomic section of fdb_add_entry, but with switchdev offering only an
> >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> >     still have to defer these events and are unable to provide
> >     synchronous feedback to user space (error codes, extack).
> > 
> > The above issues would warrant an attempt to fix a central problem, and
> > make switchdev expose an API that is easier to consume rather than
> > having drivers implement lateral workarounds.
> > 
> > In this case, we must notice that
> > 
> > (a) switchdev already has the concept of notifiers emitted from the fast
> >     path that are still processed by drivers from blocking context. This
> >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > 
> > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> >     So if we could hook into that, we could have a chance that the
> >     bridge simply waits for our FDB entry offloading procedure to finish
> >     before it calls netdev_upper_dev_unlink() - which is almost
> >     immediately afterwards, and also when switchdev drivers typically
> >     break their stateful associations between the bridge upper and
> >     private data.
> > 
> > So it is in fact possible to use switchdev's generic
> > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > from there we can call_switchdev_blocking_notifiers().
> > 
> > To address all requirements:
> > 
> > - drivers that are unconverted from atomic to blocking still work
> > - drivers that currently have a private workqueue are not worse off
> > - drivers that want the bridge to wait for their deferred work can use
> >   the bridge's defer mechanism
> > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> >   parties does not get deferred for no reason, because this takes the
> >   rtnl_mutex and schedules a worker thread for nothing
> > 
> > it looks like we can in fact start off by emitting
> > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > interested party can set this to true.
> > 
> > This way:
> > 
> > - unconverted drivers do their work (i.e. schedule their private work
> >   item) based on the atomic notifier, and do not set "needs_defer"
> > - converted drivers only mark "needs_defer" and treat a separate
> >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > 
> > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > consumers of the replayed FDB events support that (right now, that is
> > DSA and dpaa2-switch).
> > 
> > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > needs_defer as appropriate, then the notifiers emitted from process
> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > directly, and we would also have fully blocking context all the way
> > down, with the opportunity for error propagation and extack.
> 
> IIUC, at this stage all the FDB notifications drivers get are blocking,
> either from the work queue (because they were deferred) or directly from
> process context. If so, how do we synchronize the two and ensure drivers
> get the notifications at the correct order?

What does 'at this stage' mean? Does it mean 'assuming the patch we're
discussing now gets accepted'? If that's what it means, then 'at this
stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
then would set needs_defer, then would receive the blocking
FDB_ADD_TO_DEVICE.

Thinking a bit more - this two-stage notification process ends up being
more efficient for br_fdb_replay too. We don't queue up FDB entries
except if the driver tells us that it wants to process them in blocking
context.

> I was thinking of adding all the notifications to the 'deferred' list
> when 'hash_lock' is held and then calling switchdev_deferred_process()
> directly in process context. It's not very pretty (do we return an error
> only for the entry the user added or for any other entry we flushed from
> the list?), but I don't have a better idea right now.

I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to
the switchdev_fdb_add_to_device() which defers, this does not defer at
all but just call_blocking_switchdev_notifiers(). So it would not go
through switchdev_deferred_enqueue.  For the code path I talked above,
we would temporarily drop the spin_lock, then call
switchdev_fdb_add_to_device_now(), then if that fails, take the
spin_lock again and delete the software fdb entry we've just added.

So as long as we use a _now() variant and don't resynchronize with the
deferred work, we shouldn't have any ordering issues, or am I
misunderstanding your question?

> 
> > 
> > Some disadvantages of this solution though:
> > 
> > - A driver now needs to check whether it is interested in an event
> >   twice: first on the atomic call chain, then again on the blocking call
> >   chain (because it is a notifier chain, it is potentially not the only
> >   driver subscribed to it, it may be listening to another driver's
> >   needs_defer request). The flip side: on sistems with mixed switchdev
> >   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
> >   entries on foreign interfaces), there are some "synergies", and the
> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
> >   opposed to what would happen if each driver scheduled its own private
> >   work item.
> > 
> > - Right now drivers take rtnl_lock() as soon as their private work item
> >   runs. They need the rtnl_lock for the call to
> >   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 
> 
> I think RCU is enough?

Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set
would notify back rtnetlink, but it looks like it doesn't.

> >   But it doesn't really seem necessary for them to perform the actual
> >   hardware manipulation (adding the FDB entry) with the rtnl_lock held
> >   (anyway most do that). But with the new option of servicing
> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken
> >   top-level by switchdev, so even if these drivers wanted to be more
> >   self-conscious, they couldn't.
> 
> Yes, I want to remove this dependency in mlxsw assuming notifications
> remain atomic. The more pressing issue is actually removing it from the
> learning path.

Bah, I understand where you're coming from, but it would be tricky to
remove the rtnl_lock from switchdev_deferred_process_work (that's what
it boils down to). My switchdev_handle_fdb_add_to_device helper currently
assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE,
it would still need to traverse the netdev adjacency lists, so it would
need the rtnl_mutex for that. If we remove the rtnl_lock from
switchdev_deferred_process_work we'd have to add it back in DSA and to
any other callers of switchdev_handle_fdb_add_to_device.