[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201123112130.759b9487@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date: Mon, 23 Nov 2020 11:21:30 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Leon Romanovsky <leon@...nel.org>,
Jason Wang <jasowang@...hat.com>,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Petr Mladek <pmladek@...e.com>,
John Ogness <john.ogness@...utronix.de>,
virtualization@...ts.linux-foundation.org,
Amit Shah <amit@...nel.org>, Itay Aveksis <itayav@...dia.com>,
Ran Rozenstein <ranro@...dia.com>,
netdev <netdev@...r.kernel.org>
Subject: Re: netconsole deadlock with virtnet
On Mon, 23 Nov 2020 14:09:34 -0500 Steven Rostedt wrote:
> On Mon, 23 Nov 2020 10:52:52 -0800
> Jakub Kicinski <kuba@...nel.org> wrote:
>
> > On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > > On Mon, 23 Nov 2020 13:08:55 +0200
> > > Leon Romanovsky <leon@...nel.org> wrote:
> > >
> > >
> > > > [ 10.028024] Chain exists of:
> > > > [ 10.028025] console_owner --> target_list_lock --> _xmit_ETHER#2
> > >
> > > Note, the problem is that we have a location that grabs the xmit_lock while
> > > holding target_list_lock (and possibly console_owner).
> >
> > Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> >
> > (not that I condone the shenanigans that are going on here)
>
> Does it?
>
> virtnet_poll_tx() {
> __netif_tx_lock() {
> spin_lock(&txq->_xmit_lock);
Umpf. Right. I was looking at virtnet_poll_cleantx()
> That looks like we can have:
>
>
> CPU0 CPU1
> ---- ----
> lock(xmit_lock)
>
> lock(console)
> lock(target_list_lock)
> __netif_tx_lock()
> lock(xmit_lock);
>
> [BLOCKED]
>
> <interrupt>
> lock(console)
>
> [BLOCKED]
>
>
>
> DEADLOCK.
>
>
> So where is the trylock here?
>
> Perhaps you need the trylock in virtnet_poll_tx()?
That could work. Best if we used normal lock if !!budget, and trylock
when budget is 0. But maybe that's too hairy.
I'm assuming all this trickiness comes from virtqueue_get_buf() needing
locking vs the TX path? It's pretty unusual for the completion path to
need locking vs xmit path.
Powered by blists - more mailing lists