netdev - Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YSORsKDOwklF19Gm@shredder>
Date:   Mon, 23 Aug 2021 15:16:48 +0300
From:   Ido Schimmel <idosch@...sch.org>
To:     Vladimir Oltean <olteanv@...il.com>
Cc:     Nikolay Aleksandrov <nikolay@...dia.com>,
        Vladimir Oltean <vladimir.oltean@....com>,
        netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Roopa Prabhu <roopa@...dia.com>, Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Vadym Kochan <vkochan@...vell.com>,
        Taras Chornyi <tchornyi@...vell.com>,
        Jiri Pirko <jiri@...dia.com>, Ido Schimmel <idosch@...dia.com>,
        UNGLinuxDriver@...rochip.com,
        Grygorii Strashko <grygorii.strashko@...com>,
        Marek Behun <kabel@...ckhole.sk>,
        DENG Qingfang <dqfext@...il.com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Hauke Mehrtens <hauke@...ke-m.de>,
        Woojung Huh <woojung.huh@...rochip.com>,
        Sean Wang <sean.wang@...iatek.com>,
        Landen Chao <Landen.Chao@...iatek.com>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Alexandre Belloni <alexandre.belloni@...tlin.com>,
        George McCollister <george.mccollister@...il.com>,
        Ioana Ciornei <ioana.ciornei@....com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky <leon@...nel.org>,
        Lars Povlsen <lars.povlsen@...rochip.com>,
        Steen Hegelund <Steen.Hegelund@...rochip.com>,
        Julian Wiedmann <jwi@...ux.ibm.com>,
        Karsten Graul <kgraul@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Ivan Vecera <ivecera@...hat.com>,
        Vlad Buslov <vladbu@...dia.com>,
        Jianbo Liu <jianbol@...dia.com>,
        Mark Bloch <mbloch@...dia.com>, Roi Dayan <roid@...dia.com>,
        Tobias Waldekranz <tobias@...dekranz.com>,
        Vignesh Raghavendra <vigneshr@...com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>
Subject: Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
 blocking

On Mon, Aug 23, 2021 at 02:00:46PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > > > >    deferred by drivers even from code paths that are initially blocking
> > > > >    (are running in process context):
> > > > >
> > > > > br_fdb_add
> > > > > -> __br_fdb_add
> > > > >    -> fdb_add_entry
> > > > >       -> fdb_notify
> > > > >          -> br_switchdev_fdb_notify
> > > > >
> > > > >     It seems fairly trivial to move the fdb_notify call outside of the
> > > > >     atomic section of fdb_add_entry, but with switchdev offering only an
> > > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > > > >     still have to defer these events and are unable to provide
> > > > >     synchronous feedback to user space (error codes, extack).
> > > > >
> > > > > The above issues would warrant an attempt to fix a central problem, and
> > > > > make switchdev expose an API that is easier to consume rather than
> > > > > having drivers implement lateral workarounds.
> > > > >
> > > > > In this case, we must notice that
> > > > >
> > > > > (a) switchdev already has the concept of notifiers emitted from the fast
> > > > >     path that are still processed by drivers from blocking context. This
> > > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > > >
> > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > > > >     So if we could hook into that, we could have a chance that the
> > > > >     bridge simply waits for our FDB entry offloading procedure to finish
> > > > >     before it calls netdev_upper_dev_unlink() - which is almost
> > > > >     immediately afterwards, and also when switchdev drivers typically
> > > > >     break their stateful associations between the bridge upper and
> > > > >     private data.
> > > > >
> > > > > So it is in fact possible to use switchdev's generic
> > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > > > from there we can call_switchdev_blocking_notifiers().
> > > > >
> > > > > To address all requirements:
> > > > >
> > > > > - drivers that are unconverted from atomic to blocking still work
> > > > > - drivers that currently have a private workqueue are not worse off
> > > > > - drivers that want the bridge to wait for their deferred work can use
> > > > >   the bridge's defer mechanism
> > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > > > >   parties does not get deferred for no reason, because this takes the
> > > > >   rtnl_mutex and schedules a worker thread for nothing
> > > > >
> > > > > it looks like we can in fact start off by emitting
> > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > > > interested party can set this to true.
> > > > >
> > > > > This way:
> > > > >
> > > > > - unconverted drivers do their work (i.e. schedule their private work
> > > > >   item) based on the atomic notifier, and do not set "needs_defer"
> > > > > - converted drivers only mark "needs_defer" and treat a separate
> > > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > >
> > > > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > > > consumers of the replayed FDB events support that (right now, that is
> > > > > DSA and dpaa2-switch).
> > > > >
> > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > > > needs_defer as appropriate, then the notifiers emitted from process
> > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > > directly, and we would also have fully blocking context all the way
> > > > > down, with the opportunity for error propagation and extack.
> > > >
> > > > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > > > either from the work queue (because they were deferred) or directly from
> > > > process context. If so, how do we synchronize the two and ensure drivers
> > > > get the notifications at the correct order?
> > >
> > > What does 'at this stage' mean? Does it mean 'assuming the patch we're
> > > discussing now gets accepted'? If that's what it means, then 'at this
> > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> > > then would set needs_defer, then would receive the blocking
> > > FDB_ADD_TO_DEVICE.
> >
> > I meant after:
> >
> > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > needs_defer as appropriate, then the notifiers emitted from process
> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > directly, and we would also have fully blocking context all the way
> > down, with the opportunity for error propagation and extack."
> >
> > IIUC, after the conversion the 'needs_defer' is gone and all the FDB
> > events are blocking? Either from syscall context or the workqueue.
> 
> We would not delete 'needs_defer'. It still offers a useful preliminary
> filtering mechanism for the fast path (and for br_fdb_replay). In
> retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
> instead of jumping to blocking context (if we care so much about performance).
> 
> If a FDB event does not need to be processed by anyone (dynamically
> learned entry on a switchdev port), the bridge notifies the atomic call
> chain for the sake of it, but not the blocking chain.
> 
> > If so, I'm not sure how we synchronize the two. That is, making sure
> > that an event from syscall context does not reach drivers before an
> > earlier event that was added to the 'deferred' list.
> >
> > I mean, in syscall context we are holding RTNL so whatever is already on
> > the 'deferred' list cannot be dequeued and processed.
> 
> So switchdev_deferred_process() has ASSERT_RTNL. If we call
> switchdev_deferred_process() right before adding the blocking FDB entry
> in process context (and we already hold rtnl_mutex), I though that would
> be enough to ensure we have a synchronization point: Everything that was
> scheduled before is flushed now, everything that is scheduled while we
> are running will run after we unlock the rtnl_mutex. Is that not the
> order we expect? I mean, if there is a fast path FDB entry being learned
> / deleted while user space say adds that same FDB entry as static, how
> is the relative ordering ensured between the two?

I was thinking about the following case:

t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
     response to STP state. Notifications are added to 'deferred' list
t2 - switchdev_deferred_process() is called in syscall context
t3 - <MAC1,VID1,P1> is notified as blocking

Updates to the SW FDB are protected by 'hash_lock', but updates to the
HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
it will exist in HW.

Another case assuming switchdev_deferred_process() is called first:

t0 - switchdev_deferred_process() is called in syscall context
t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
     to 'deferred' list
t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
     <MAC1,VID1,P2>
t3 - <MAC1,VID1,P2> is notified as blocking
t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
     list is processed)

In this case, the HW will have <MAC1,VID1,P1>, but SW will have
<MAC1,VID1,P2>