[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3329047.e9J7NaK4W3@saruman>
Date: Sun, 02 Oct 2022 11:56:28 +0100
From: James Hogan <jhogan@...nel.org>
To: Vinicius Costa Gomes <vinicius.gomes@...el.com>
Cc: Paul Menzel <pmenzel@...gen.mpg.de>,
Tony Nguyen <anthony.l.nguyen@...el.com>,
Jesse Brandeburg <jesse.brandeburg@...el.com>,
netdev@...r.kernel.org, intel-wired-lan@...ts.osuosl.org,
Sasha Neftin <sasha.neftin@...el.com>,
Aleksandr Loktionov <aleksandr.loktionov@...el.com>
Subject: Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
On Monday, 29 August 2022 09:16:33 BST James Hogan wrote:
> On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> > On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > > James Hogan <jhogan@...nel.org> writes:
> > > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > > >> It was reported a RTNL deadlock in the igc driver that was causing
> > > >> problems during suspend/resume.
> > > >>
> > > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > > >> caused by taking RTNL in RPM resume path").
> > > >>
> > > >> Reported-by: James Hogan <jhogan@...nel.org>
> > > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> > > >> ---
> > > >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> > > >> enabled.
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > This is identical to the patch I've been running for the last week.
> > > > The
> > > > deadlock is avoided, however I now occasionally see an assertion from
> > > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > > cases
> > > > via the runtime_resume path, and a suspicious
> > > > rcu_dereference_protected()
> > > > warning (presumably due to the same issue of the lock not being
> > > > taken).
> > > > See here for details:
> > > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> > >
> > > Oh, sorry. I missed the part that the rtnl assert splat was already
> > > using similar/identical code to what I got/copied from igb.
> > >
> > > So what this seems to be telling us is that the "fix" from igb is only
> > > hiding the issue,
> >
> > I suppose the patch just changes the assumption from "lock will never be
> > held on runtime resume path" (incorrect, deadlock) to "lock will always be
> > held on runtime resume path" (also incorrect, probably racy).
> >
> > > and we would need to remove the need for taking the
> > > RTNL for the suspend/resume paths in igc and igb? (as someone else said
> > > in that igb thread, iirc)
> >
> > (I'll defer to others on this. I'm pretty unfamiliar with networking code
> > and this particular lock.)
>
> I'd be great to have this longstanding issue properly fixed rather than
> having to carry a patch locally that may not be lock safe.
>
> Also, any tips for diagnosing the issue of the network link not coming back
> up after resume? I sometimes have to unload and reload the driver module to
> get it back again.
Any thoughts on this from anybody?
Cheers
James
Powered by blists - more mailing lists