lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 04 Aug 2022 22:41:01 +0100
From:   James Hogan <jhogan@...nel.org>
To:     Paul Menzel <pmenzel@...gen.mpg.de>
Cc:     Tony Nguyen <anthony.l.nguyen@...el.com>,
        Jesse Brandeburg <jesse.brandeburg@...el.com>,
        Vinicius Costa Gomes <vinicius.gomes@...el.com>,
        intel-wired-lan@...ts.osuosl.org,
        Sasha Neftin <sasha.neftin@...el.com>,
        Aleksandr Loktionov <aleksandr.loktionov@...el.com>,
        netdev@...r.kernel.org
Subject: Re: [Intel-wired-lan] I225-V (igc driver) hangs after resume in igc_resume/igc_tsn_reset

On Thursday, 4 August 2022 14:27:24 BST Paul Menzel wrote:
> Am 04.08.22 um 15:03 schrieb James Hogan:
> > On Thursday, 28 July 2022 18:36:31 BST James Hogan wrote:
> >> On Wednesday, 27 July 2022 15:37:09 BST Vinicius Costa Gomes wrote:
> >>> Yeah, I agree that it seems like the issue is something else. I would
> >>> suggest start with the "simple" things, enabling 'CONFIG_PROVE_LOCKING'
> >>> and looking at the first splat, it could be that what you are seeing is
> >>> caused by a deadlock somewhere else.
> >> 
> >> This is revealing I think (re-enabled PCIE_PTM and enabled
> >> PROVE_LOCKING).
> >> 
> >> In this case it happened within minutes of boot, but a few previous
> >> attempts with several suspend cycles with the same kernel didn't detect
> >> the same thing.
> > 
> > I hate to nag, but any thoughts on the lockdep recursive locking warning
> > below? It seems to indicate a recursive taking of rtnl_mutex in
> > dev_ethtool
> > and igc_resume, which would certainly seem to point the finger squarely
> > back at the igc driver.
> 
> I hope, the developers will respond quickly. If it is indeed a
> regression, and you do not want to wait for the developers, you could
> try to bisect the issue. To speed up the test cycles, I recommend to try
> to try to reproduce the issue in QEMU/KVM and passing through the
> network device.

Unfortunately its new hardware for me, so I don't know if there's a good 
working version of the driver. I've only had constant pain with it so far. 
Frequent failed resumes, hangs on shutdown.

However I just did a bit more research and found these dead threads from a 
year ago which appear to pinpoint the issue:
https://lore.kernel.org/all/20210420075406.64105-1-acelan.kao@canonical.com/
https://lore.kernel.org/all/20210809032809.1224002-1-acelan.kao@canonical.com/

They would appear to point to this commit as the one which triggered the 
breakage:
bd869245a3dc ("net: core: try to runtime-resume detached device in 
__dev_open")

Cc'ing Sasha and Aleks (they were added to the latter thread).
Also cc'ing netdev, since it seems to relate to interaction with common code?

Cheers
James


> >> NetworkManager[857]: <info>  [1659028974.1752] device (enp6s0): state
> >> change: activated -> unavailable (reason 'carrier-changed',
> >> sys-iface-state: 'managed')
> >> 
> >> ============================================
> >> WARNING: possible recursive locking detected
> >> 5.18.12-arch1-1 #2 Not tainted
> >> --------------------------------------------
> >> NetworkManager/857 is trying to acquire lock:
> >> ffffffff9f9e9048 (rtnl_mutex){+.+.}-{3:3}, at: igc_resume+0xf6/0x1d0
> >> [igc]
> >> 
> >> but task is already holding lock:
> >> ffffffff9f9e9048 (rtnl_mutex){+.+.}-{3:3}, at: dev_ethtool+0xaf/0x3080
> >> 
> >> other info that might help us debug this:
> >>   Possible unsafe locking scenario:
> >>         CPU0
> >>         ----
> >>    
> >>    lock(rtnl_mutex);
> >>    lock(rtnl_mutex);
> >>   
> >>   *** DEADLOCK ***
> >>   May be due to missing lock nesting notation
> >> 
> >> 1 lock held by NetworkManager/857:
> >>   #0: ffffffff9f9e9048 (rtnl_mutex){+.+.}-{3:3}, at:
> >>   dev_ethtool+0xaf/0x3080
> >> 
> >> stack backtrace:
> >> CPU: 0 PID: 857 Comm: NetworkManager Not tainted 5.18.12-arch1-1 #2
> >> 369425cead7bf2331cd4c5d2279465ad4a0fc21f Hardware name: Micro-Star
> >> International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.40
> >> 
> >> 05/17/2022 Call Trace:
> >>   <TASK>
> >>   dump_stack_lvl+0x5f/0x78
> >>   __lock_acquire.cold+0xd4/0x2e5
> >>   ? __lock_acquire+0x3b2/0x1fd0
> >>   lock_acquire+0xc8/0x2d0
> >>   ? igc_resume+0xf6/0x1d0 [igc beed6d83546b18fcf82fbbfaeea59871823bceeb]
> >>   ? lock_is_held_type+0xaa/0x120
> >>   __mutex_lock+0xb6/0x830
> >>   ? igc_resume+0xf6/0x1d0 [igc beed6d83546b18fcf82fbbfaeea59871823bceeb]
> >>   ? lockdep_hardirqs_on_prepare+0xdd/0x180
> >>   ? _raw_spin_unlock_irqrestore+0x34/0x50
> >>   ? igc_resume+0xf6/0x1d0 [igc beed6d83546b18fcf82fbbfaeea59871823bceeb]
> >>   ? igc_resume+0xf6/0x1d0 [igc beed6d83546b18fcf82fbbfaeea59871823bceeb]
> >>   igc_resume+0xf6/0x1d0 [igc beed6d83546b18fcf82fbbfaeea59871823bceeb]
> >>   pci_pm_runtime_resume+0xab/0xd0
> >>   ? pci_pm_freeze_noirq+0xe0/0xe0
> >>   __rpm_callback+0x41/0x160
> >>   rpm_callback+0x35/0x70
> >>   ? pci_pm_freeze_noirq+0xe0/0xe0
> >>   rpm_resume+0x5eb/0x820
> >>   __pm_runtime_resume+0x4b/0x80
> >>   dev_ethtool+0x128/0x3080
> >>   ? lock_is_held_type+0xaa/0x120
> >>   ? find_held_lock+0x2b/0x80
> >>   ? dev_load+0x57/0x140
> >>   ? lock_release+0xd4/0x2d0
> >>   dev_ioctl+0x155/0x560
> >>   sock_do_ioctl+0xd7/0x120
> >>   sock_ioctl+0x103/0x360
> >>   ? __fget_files+0xd2/0x170
> >>   __x64_sys_ioctl+0x8e/0xc0
> >>   do_syscall_64+0x5c/0x90
> >>   ? do_syscall_64+0x6b/0x90
> >>   ? lockdep_hardirqs_on_prepare+0xdd/0x180
> >>   ? do_syscall_64+0x6b/0x90
> >>   ? lockdep_hardirqs_on_prepare+0xdd/0x180
> >>   ? do_syscall_64+0x6b/0x90
> >>   ? asm_sysvec_apic_timer_interrupt+0xe/0x20
> >>   ? rcu_read_lock_sched_held+0x40/0x80
> >>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> 
> >> RIP: 0033:0x7f2c35d077af
> >> Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89
> >> 44
> >> 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0
> >> ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 RSP:
> >> 002b:00007ffd7319afd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX:
> >> ffffffffffffffda RBX: 00007ffd7319b2c0 RCX: 00007f2c35d077af
> >> RDX: 00007ffd7319b0f0 RSI: 0000000000008946 RDI: 0000000000000012
> >> RBP: 00007ffd7319b270 R08: 0000000000000000 R09: 00007ffd7319b2c8
> >> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >> R13: 00007ffd7319b0f0 R14: 00007ffd7319b0d0 R15: 00007ffd7319b0d0
> >> 
> >>   </TASK>




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ