lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3186253.aeNJFYEL58@saruman>
Date:   Mon, 29 Aug 2022 09:16:33 +0100
From:   James Hogan <jhogan@...nel.org>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
Cc:     Paul Menzel <pmenzel@...gen.mpg.de>,
        Tony Nguyen <anthony.l.nguyen@...el.com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>,
        netdev@...r.kernel.org, intel-wired-lan@...ts.osuosl.org,
        Sasha Neftin <sasha.neftin@...el.com>,
        Aleksandr Loktionov <aleksandr.loktionov@...el.com>
Subject: Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path

On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > James Hogan <jhogan@...nel.org> writes:
> > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > >> It was reported a RTNL deadlock in the igc driver that was causing
> > >> problems during suspend/resume.
> > >> 
> > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > >> caused by taking RTNL in RPM resume path").
> > >> 
> > >> Reported-by: James Hogan <jhogan@...nel.org>
> > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> > >> ---
> > >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> > >> enabled.
> > > 
> > > Thanks for looking into this.
> > > 
> > > This is identical to the patch I've been running for the last week. The
> > > deadlock is avoided, however I now occasionally see an assertion from
> > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > cases
> > > via the runtime_resume path, and a suspicious
> > > rcu_dereference_protected()
> > > warning (presumably due to the same issue of the lock not being taken).
> > > See here for details:
> > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> > 
> > Oh, sorry. I missed the part that the rtnl assert splat was already
> > using similar/identical code to what I got/copied from igb.
> > 
> > So what this seems to be telling us is that the "fix" from igb is only
> > hiding the issue,
> 
> I suppose the patch just changes the assumption from "lock will never be
> held on runtime resume path" (incorrect, deadlock) to "lock will always be
> held on runtime resume path" (also incorrect, probably racy).
> 
> > and we would need to remove the need for taking the
> > RTNL for the suspend/resume paths in igc and igb? (as someone else said
> > in that igb thread, iirc)
> 
> (I'll defer to others on this. I'm pretty unfamiliar with networking code
> and this particular lock.)

I'd be great to have this longstanding issue properly fixed rather than having 
to carry a patch locally that may not be lock safe.

Also, any tips for diagnosing the issue of the network link not coming back up 
after resume? I sometimes have to unload and reload the driver module to get 
it back again.

Cheers
James


Powered by blists - more mailing lists