[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3186253.aeNJFYEL58@saruman>
Date: Mon, 29 Aug 2022 09:16:33 +0100
From: James Hogan <jhogan@...nel.org>
To: Vinicius Costa Gomes <vinicius.gomes@...el.com>
Cc: Paul Menzel <pmenzel@...gen.mpg.de>,
Tony Nguyen <anthony.l.nguyen@...el.com>,
Jesse Brandeburg <jesse.brandeburg@...el.com>,
netdev@...r.kernel.org, intel-wired-lan@...ts.osuosl.org,
Sasha Neftin <sasha.neftin@...el.com>,
Aleksandr Loktionov <aleksandr.loktionov@...el.com>
Subject: Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > James Hogan <jhogan@...nel.org> writes:
> > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > >> It was reported a RTNL deadlock in the igc driver that was causing
> > >> problems during suspend/resume.
> > >>
> > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > >> caused by taking RTNL in RPM resume path").
> > >>
> > >> Reported-by: James Hogan <jhogan@...nel.org>
> > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> > >> ---
> > >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> > >> enabled.
> > >
> > > Thanks for looking into this.
> > >
> > > This is identical to the patch I've been running for the last week. The
> > > deadlock is avoided, however I now occasionally see an assertion from
> > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > cases
> > > via the runtime_resume path, and a suspicious
> > > rcu_dereference_protected()
> > > warning (presumably due to the same issue of the lock not being taken).
> > > See here for details:
> > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> >
> > Oh, sorry. I missed the part that the rtnl assert splat was already
> > using similar/identical code to what I got/copied from igb.
> >
> > So what this seems to be telling us is that the "fix" from igb is only
> > hiding the issue,
>
> I suppose the patch just changes the assumption from "lock will never be
> held on runtime resume path" (incorrect, deadlock) to "lock will always be
> held on runtime resume path" (also incorrect, probably racy).
>
> > and we would need to remove the need for taking the
> > RTNL for the suspend/resume paths in igc and igb? (as someone else said
> > in that igb thread, iirc)
>
> (I'll defer to others on this. I'm pretty unfamiliar with networking code
> and this particular lock.)
I'd be great to have this longstanding issue properly fixed rather than having
to carry a patch locally that may not be lock safe.
Also, any tips for diagnosing the issue of the network link not coming back up
after resume? I sometimes have to unload and reload the driver module to get
it back again.
Cheers
James
Powered by blists - more mailing lists