[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200213163221.26bd882f@kicinski-fedora-pc1c0hjn.DHCP.thefacebook.com>
Date: Thu, 13 Feb 2020 16:32:21 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Or Gerlitz <gerlitz.or@...il.com>
Cc: Or Gerlitz <ogerlitz@...lanox.com>,
Tariq Toukan <tariqt@...lanox.com>,
Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: [PATCH net] net/tls: Act on going down event
On Thu, 13 Feb 2020 22:07:19 +0200 Or Gerlitz wrote:
> On Thu, Feb 13, 2020 at 8:34 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > On Thu, 13 Feb 2020 16:54:07 +0000 Or Gerlitz wrote:
> > > By the time of the down event, the netdevice stop ndo was
> > > already called and the nic driver is likely to destroy the HW
> > > objects/constructs which are used for the tls_dev_resync op.
>
> >> @@ -1246,7 +1246,7 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
> > > - case NETDEV_DOWN:
> > > + case NETDEV_GOING_DOWN:
>
> > Now we'll have two race conditions. 1. Traffic is still running while
> > we remove the connection state. 2. We clean out the state and then
> > close the device. Between the two new connections may be installed.
> >
> > I think it's an inherently racy to only be able to perform clean up
> > while the device is still up.
>
> good points, I have to think on both of them and re/sync (..) with
> the actual design details here, I came across this while working
> on something else which is still more of a research so just throwing
> a patch here was moving too fast.
>
> Repeating myself -- by the time of the down event, the netdevice
> stop ndo was already called and the nic driver is likely to destroy
> HW objects/constructs which are used for the tls_dev_resync op.
>
> For example suppose a NIC driver uses some QP/ring to post resync
> request to the HW and these rings are part of the driver's channels and
> the channels are destroyed when the stop ndo is called - the tls code
> here uses RW synchronization between invoking the resync driver op
> to the down event handling. But the down event comes only after these
> resources were destroyed, too late. If we will safely/stop the offload
> at the going down stage, we can avoid a much more complex and per nic
> driver locking which I guess would be needed to protect against that race.
>
> thoughts?
I'm not sure what happens on the device side, so take this with
a pinch of salt.
The device flushing some state automatically on down sounds like
a pretty error prone design, I don't think you'd do that :)
So I think you're just using datapath components associated with a
vNIC/function to speed up communication with the device? That does
get painful quite quickly if the device does not have a idea of
control QPs/functions/vNICs with independent state :S
We could slice the downing in the stack into one more stage where
device is considered down but resources (mem, irq, qps) are not freed,
but I think that's not a complete fix, because one day you may want
to do BPF offloads or use QPs for devlink and those are independent of
vNIC/function state, anyway..
> - so here's
> the point -- the driver resync op may use some HW
Powered by blists - more mailing lists