[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210823110046.xuuo37kpsxdbl6c2@skbuf>
Date: Mon, 23 Aug 2021 14:00:46 +0300
From: Vladimir Oltean <olteanv@...il.com>
To: Ido Schimmel <idosch@...sch.org>
Cc: Nikolay Aleksandrov <nikolay@...dia.com>,
Vladimir Oltean <vladimir.oltean@....com>,
netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Roopa Prabhu <roopa@...dia.com>, Andrew Lunn <andrew@...n.ch>,
Florian Fainelli <f.fainelli@...il.com>,
Vivien Didelot <vivien.didelot@...il.com>,
Vadym Kochan <vkochan@...vell.com>,
Taras Chornyi <tchornyi@...vell.com>,
Jiri Pirko <jiri@...dia.com>, Ido Schimmel <idosch@...dia.com>,
UNGLinuxDriver@...rochip.com,
Grygorii Strashko <grygorii.strashko@...com>,
Marek Behun <kabel@...ckhole.sk>,
DENG Qingfang <dqfext@...il.com>,
Kurt Kanzenbach <kurt@...utronix.de>,
Hauke Mehrtens <hauke@...ke-m.de>,
Woojung Huh <woojung.huh@...rochip.com>,
Sean Wang <sean.wang@...iatek.com>,
Landen Chao <Landen.Chao@...iatek.com>,
Claudiu Manoil <claudiu.manoil@....com>,
Alexandre Belloni <alexandre.belloni@...tlin.com>,
George McCollister <george.mccollister@...il.com>,
Ioana Ciornei <ioana.ciornei@....com>,
Saeed Mahameed <saeedm@...dia.com>,
Leon Romanovsky <leon@...nel.org>,
Lars Povlsen <lars.povlsen@...rochip.com>,
Steen Hegelund <Steen.Hegelund@...rochip.com>,
Julian Wiedmann <jwi@...ux.ibm.com>,
Karsten Graul <kgraul@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ibm.com>,
Ivan Vecera <ivecera@...hat.com>,
Vlad Buslov <vladbu@...dia.com>,
Jianbo Liu <jianbol@...dia.com>,
Mark Bloch <mbloch@...dia.com>, Roi Dayan <roid@...dia.com>,
Tobias Waldekranz <tobias@...dekranz.com>,
Vignesh Raghavendra <vigneshr@...com>,
Jesse Brandeburg <jesse.brandeburg@...el.com>
Subject: Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
blocking
On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > > > deferred by drivers even from code paths that are initially blocking
> > > > (are running in process context):
> > > >
> > > > br_fdb_add
> > > > -> __br_fdb_add
> > > > -> fdb_add_entry
> > > > -> fdb_notify
> > > > -> br_switchdev_fdb_notify
> > > >
> > > > It seems fairly trivial to move the fdb_notify call outside of the
> > > > atomic section of fdb_add_entry, but with switchdev offering only an
> > > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > > > still have to defer these events and are unable to provide
> > > > synchronous feedback to user space (error codes, extack).
> > > >
> > > > The above issues would warrant an attempt to fix a central problem, and
> > > > make switchdev expose an API that is easier to consume rather than
> > > > having drivers implement lateral workarounds.
> > > >
> > > > In this case, we must notice that
> > > >
> > > > (a) switchdev already has the concept of notifiers emitted from the fast
> > > > path that are still processed by drivers from blocking context. This
> > > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > >
> > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > > > So if we could hook into that, we could have a chance that the
> > > > bridge simply waits for our FDB entry offloading procedure to finish
> > > > before it calls netdev_upper_dev_unlink() - which is almost
> > > > immediately afterwards, and also when switchdev drivers typically
> > > > break their stateful associations between the bridge upper and
> > > > private data.
> > > >
> > > > So it is in fact possible to use switchdev's generic
> > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > > from there we can call_switchdev_blocking_notifiers().
> > > >
> > > > To address all requirements:
> > > >
> > > > - drivers that are unconverted from atomic to blocking still work
> > > > - drivers that currently have a private workqueue are not worse off
> > > > - drivers that want the bridge to wait for their deferred work can use
> > > > the bridge's defer mechanism
> > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > > > parties does not get deferred for no reason, because this takes the
> > > > rtnl_mutex and schedules a worker thread for nothing
> > > >
> > > > it looks like we can in fact start off by emitting
> > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > > interested party can set this to true.
> > > >
> > > > This way:
> > > >
> > > > - unconverted drivers do their work (i.e. schedule their private work
> > > > item) based on the atomic notifier, and do not set "needs_defer"
> > > > - converted drivers only mark "needs_defer" and treat a separate
> > > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > >
> > > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > > consumers of the replayed FDB events support that (right now, that is
> > > > DSA and dpaa2-switch).
> > > >
> > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > > needs_defer as appropriate, then the notifiers emitted from process
> > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > directly, and we would also have fully blocking context all the way
> > > > down, with the opportunity for error propagation and extack.
> > >
> > > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > > either from the work queue (because they were deferred) or directly from
> > > process context. If so, how do we synchronize the two and ensure drivers
> > > get the notifications at the correct order?
> >
> > What does 'at this stage' mean? Does it mean 'assuming the patch we're
> > discussing now gets accepted'? If that's what it means, then 'at this
> > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> > then would set needs_defer, then would receive the blocking
> > FDB_ADD_TO_DEVICE.
>
> I meant after:
>
> "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> needs_defer as appropriate, then the notifiers emitted from process
> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> directly, and we would also have fully blocking context all the way
> down, with the opportunity for error propagation and extack."
>
> IIUC, after the conversion the 'needs_defer' is gone and all the FDB
> events are blocking? Either from syscall context or the workqueue.
We would not delete 'needs_defer'. It still offers a useful preliminary
filtering mechanism for the fast path (and for br_fdb_replay). In
retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
instead of jumping to blocking context (if we care so much about performance).
If a FDB event does not need to be processed by anyone (dynamically
learned entry on a switchdev port), the bridge notifies the atomic call
chain for the sake of it, but not the blocking chain.
> If so, I'm not sure how we synchronize the two. That is, making sure
> that an event from syscall context does not reach drivers before an
> earlier event that was added to the 'deferred' list.
>
> I mean, in syscall context we are holding RTNL so whatever is already on
> the 'deferred' list cannot be dequeued and processed.
So switchdev_deferred_process() has ASSERT_RTNL. If we call
switchdev_deferred_process() right before adding the blocking FDB entry
in process context (and we already hold rtnl_mutex), I though that would
be enough to ensure we have a synchronization point: Everything that was
scheduled before is flushed now, everything that is scheduled while we
are running will run after we unlock the rtnl_mutex. Is that not the
order we expect? I mean, if there is a fast path FDB entry being learned
/ deleted while user space say adds that same FDB entry as static, how
is the relative ordering ensured between the two?
Powered by blists - more mailing lists