[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNUi1v3sjODWYm4jNEb6uOq44RD0d015G=7aXEYMvrinQ@mail.gmail.com>
Date: Wed, 16 Apr 2025 08:47:14 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Stanislav Fomichev <stfomichev@...il.com>, Taehee Yoo <ap420073@...il.com>, davem@...emloft.net,
pabeni@...hat.com, edumazet@...gle.com, andrew+netdev@...n.ch,
horms@...nel.org, asml.silence@...il.com, dw@...idwei.uk, sdf@...ichev.me,
skhawaja@...gle.com, simona.vetter@...ll.ch, kaiyuanz@...gle.com,
netdev@...r.kernel.org
Subject: Re: [PATCH net] net: devmem: fix kernel panic when socket close after
module unload
On Tue, Apr 15, 2025 at 7:59 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Tue, 15 Apr 2025 11:59:40 -0700 Stanislav Fomichev wrote:
> > > commit 42f342387841 ("net: fix use-after-free in the
> > > netdev_nl_sock_priv_destroy()") and rolling back a few fixes, it's
> > > really introduced by commit 1d22d3060b9b ("net: drop rtnl_lock for
> > > queue_mgmt operations").
> > >
> > > My first question, does this issue still reproduce if you remove the
> > > per netdev locking and go back to relying on rtnl_locking? Or do we
> > > crash somewhere else in net_devmem_unbind_dmabuf? If so, where?
> > > Looking through the rest of the unbinding code, it's not clear to me
> > > any of it actually uses dev, so it may just be the locking...
> >
> > A proper fix, most likely, will involve resetting binding->dev to NULL
> > when the device is going away.
>
> Right, tho a bit of work and tricky handling will be necessary to get
> that right. We're not holding a ref on binding->dev.
>
> I think we need to invert the socket mutex vs instance lock ordering.
> Make the priv mutex protect the binding->list and binding->dev.
> For that to work the binding needs to also store a pointer to its
> owning socket?
>
> Then in both uninstall paths (from socket and from netdev unreg) we can
> take the socket mutex, delete from list, clear the ->dev pointer,
> unlock, release the ref on the binding.
>
I don't like that the ref obtained by the socket can be released by
both the socket and the netdev unreg :( It creates a weird mental
model where the ref owned by the socket can actually be dropped by the
netdev unreg path and then the socket close needs to detect that
something else dropped its ref. It also creates a weird scenario where
the device got unregistered and reregistered (I assume that's
possible? Or no?) and the socket is alive and the device is registered
but actually the binding is not active.
> The socket close path would probably need to lock the socket, look at
> the first entry, if entry has ->dev call netdev_hold(), release the
> socket, lock the netdev, lock the socket again, look at the ->dev, if
> NULL we raced - done. If not NULL release the socket, call unbind.
> netdev_put(). Restart this paragraph.
>
> I can't think of an easier way.
>
How about, roughly:
- the binding holds a ref on dev, making sure that the dev is alive
until the last ref on the binding is dropped and the binding is freed.
- The ref owned by the socket is only ever dropped by the socket close.
- When we netdev_lock(binding->dev) to later do a
net_devmem_dmabuf_unbind, we must first grab another ref on the
binding->dev, so that it doesn't get freed if the unbind drops the
last ref.
I think that would work too?
Can you remind me why we do a dev_memory_provider_uninstall on a
device unregister? If the device gets unregistered then re-registered
(again, I'm kinda assuming that is possible, I'm not sure) I expect it
to still be memory provider bound, because the netlink socket is still
alive and the userspace is still expecting a live binding. Maybe
delete the dev_memory_provider_uninstall code I added on unregister,
and sorry I put it there...? Or is there some reason I'm forgetting
that we have to uninstall the memory provider on unregister?
--
Thanks,
Mina
Powered by blists - more mailing lists