netdev - Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210823110046.xuuo37kpsxdbl6c2@skbuf>
Date:   Mon, 23 Aug 2021 14:00:46 +0300
From:   Vladimir Oltean <olteanv@...il.com>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     Nikolay Aleksandrov <nikolay@...dia.com>,
        Vladimir Oltean <vladimir.oltean@....com>,
        netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Roopa Prabhu <roopa@...dia.com>, Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        Vivien Didelot <vivien.didelot@...il.com>,
        Vadym Kochan <vkochan@...vell.com>,
        Taras Chornyi <tchornyi@...vell.com>,
        Jiri Pirko <jiri@...dia.com>, Ido Schimmel <idosch@...dia.com>,
        UNGLinuxDriver@...rochip.com,
        Grygorii Strashko <grygorii.strashko@...com>,
        Marek Behun <kabel@...ckhole.sk>,
        DENG Qingfang <dqfext@...il.com>,
        Kurt Kanzenbach <kurt@...utronix.de>,
        Hauke Mehrtens <hauke@...ke-m.de>,
        Woojung Huh <woojung.huh@...rochip.com>,
        Sean Wang <sean.wang@...iatek.com>,
        Landen Chao <Landen.Chao@...iatek.com>,
        Claudiu Manoil <claudiu.manoil@....com>,
        Alexandre Belloni <alexandre.belloni@...tlin.com>,
        George McCollister <george.mccollister@...il.com>,
        Ioana Ciornei <ioana.ciornei@....com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky <leon@...nel.org>,
        Lars Povlsen <lars.povlsen@...rochip.com>,
        Steen Hegelund <Steen.Hegelund@...rochip.com>,
        Julian Wiedmann <jwi@...ux.ibm.com>,
        Karsten Graul <kgraul@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Ivan Vecera <ivecera@...hat.com>,
        Vlad Buslov <vladbu@...dia.com>,
        Jianbo Liu <jianbol@...dia.com>,
        Mark Bloch <mbloch@...dia.com>, Roi Dayan <roid@...dia.com>,
        Tobias Waldekranz <tobias@...dekranz.com>,
        Vignesh Raghavendra <vigneshr@...com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>
Subject: Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
 blocking

On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > > >    deferred by drivers even from code paths that are initially blocking
> > > >    (are running in process context):
> > > >
> > > > br_fdb_add
> > > > -> __br_fdb_add
> > > >    -> fdb_add_entry
> > > >       -> fdb_notify
> > > >          -> br_switchdev_fdb_notify
> > > >
> > > >     It seems fairly trivial to move the fdb_notify call outside of the
> > > >     atomic section of fdb_add_entry, but with switchdev offering only an
> > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > > >     still have to defer these events and are unable to provide
> > > >     synchronous feedback to user space (error codes, extack).
> > > >
> > > > The above issues would warrant an attempt to fix a central problem, and
> > > > make switchdev expose an API that is easier to consume rather than
> > > > having drivers implement lateral workarounds.
> > > >
> > > > In this case, we must notice that
> > > >
> > > > (a) switchdev already has the concept of notifiers emitted from the fast
> > > >     path that are still processed by drivers from blocking context. This
> > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > >
> > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > > >     So if we could hook into that, we could have a chance that the
> > > >     bridge simply waits for our FDB entry offloading procedure to finish
> > > >     before it calls netdev_upper_dev_unlink() - which is almost
> > > >     immediately afterwards, and also when switchdev drivers typically
> > > >     break their stateful associations between the bridge upper and
> > > >     private data.
> > > >
> > > > So it is in fact possible to use switchdev's generic
> > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > > from there we can call_switchdev_blocking_notifiers().
> > > >
> > > > To address all requirements:
> > > >
> > > > - drivers that are unconverted from atomic to blocking still work
> > > > - drivers that currently have a private workqueue are not worse off
> > > > - drivers that want the bridge to wait for their deferred work can use
> > > >   the bridge's defer mechanism
> > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > > >   parties does not get deferred for no reason, because this takes the
> > > >   rtnl_mutex and schedules a worker thread for nothing
> > > >
> > > > it looks like we can in fact start off by emitting
> > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > > interested party can set this to true.
> > > >
> > > > This way:
> > > >
> > > > - unconverted drivers do their work (i.e. schedule their private work
> > > >   item) based on the atomic notifier, and do not set "needs_defer"
> > > > - converted drivers only mark "needs_defer" and treat a separate
> > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > >
> > > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > > consumers of the replayed FDB events support that (right now, that is
> > > > DSA and dpaa2-switch).
> > > >
> > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > > needs_defer as appropriate, then the notifiers emitted from process
> > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > directly, and we would also have fully blocking context all the way
> > > > down, with the opportunity for error propagation and extack.
> > >
> > > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > > either from the work queue (because they were deferred) or directly from
> > > process context. If so, how do we synchronize the two and ensure drivers
> > > get the notifications at the correct order?
> >
> > What does 'at this stage' mean? Does it mean 'assuming the patch we're
> > discussing now gets accepted'? If that's what it means, then 'at this
> > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> > then would set needs_defer, then would receive the blocking
> > FDB_ADD_TO_DEVICE.
>
> I meant after:
>
> "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> needs_defer as appropriate, then the notifiers emitted from process
> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> directly, and we would also have fully blocking context all the way
> down, with the opportunity for error propagation and extack."
>
> IIUC, after the conversion the 'needs_defer' is gone and all the FDB
> events are blocking? Either from syscall context or the workqueue.

We would not delete 'needs_defer'. It still offers a useful preliminary
filtering mechanism for the fast path (and for br_fdb_replay). In
retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
instead of jumping to blocking context (if we care so much about performance).

If a FDB event does not need to be processed by anyone (dynamically
learned entry on a switchdev port), the bridge notifies the atomic call
chain for the sake of it, but not the blocking chain.

> If so, I'm not sure how we synchronize the two. That is, making sure
> that an event from syscall context does not reach drivers before an
> earlier event that was added to the 'deferred' list.
>
> I mean, in syscall context we are holding RTNL so whatever is already on
> the 'deferred' list cannot be dequeued and processed.

So switchdev_deferred_process() has ASSERT_RTNL. If we call
switchdev_deferred_process() right before adding the blocking FDB entry
in process context (and we already hold rtnl_mutex), I though that would
be enough to ensure we have a synchronization point: Everything that was
scheduled before is flushed now, everything that is scheduled while we
are running will run after we unlock the rtnl_mutex. Is that not the
order we expect? I mean, if there is a fast path FDB entry being learned
/ deleted while user space say adds that same FDB entry as static, how
is the relative ordering ensured between the two?