lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e2ffd3d06fad236ea900d4fb439b2240@aosc.io>
Date: Tue, 05 Nov 2024 15:17:54 +0800
From: Mingcong Bai <jeffbai@...c.io>
To: Linux regressions mailing list <regressions@...ts.linux.dev>
Cc: Frederic Weisbecker <frederic@...nel.org>, LKML
 <linux-kernel@...r.kernel.org>, "Paul E. McKenney" <paulmck@...nel.org>, rcu
 <rcu@...r.kernel.org>, sakiiily@...c.io
Subject: Re: [Regression] wifi problems since tg3 started throwing rcu stall
 warnings

Hi Frederic and Thorston,

(CC-ing the laptop's owner so that she might help with further 
testing...)

在 2024-10-23 18:22,Linux regression tracking (Thorsten Leemhuis) 写道:
> On 23.10.24 12:09, Frederic Weisbecker wrote:
>> Le Wed, Oct 23, 2024 at 10:27:18AM +0200, Linux regression tracking 
>> (Thorsten Leemhuis) a écrit :
>>> Hi, Thorsten here, the Linux kernel's regression tracker.
>>> 
>>> Frederic, I noticed a report about a regression in 
>>> bugzilla.kernel.org
>>> that appears to be caused by the following change of yours:
>>> 
>>> 55d4669ef1b768 ("rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU
>>> invocation")
> 
>> Are you sure about the commit? Below it says:
> 
> Not totally, but...
> 
>>> As many (most?) kernel developers don't keep an eye on the bug 
>>> tracker,
>>> I decided to write this mail. To quote from
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390:
>>> 
>>>>  Mingcong Bai 2024-10-15 13:32:35 UTC
>>>> 
>>>> Since aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 between v6.10.4 and
>> 
>> Now that's aa162aa4aa383a0a714b1c36e8fcc77612ddd1a2 which I can't find 
>> in vanilla
>> tree.
> 
> ...quite, as that is the commit-id of the backport to v6.10.5; and the
> reporter reverted it there. Ideally of course that would have happened
> on recent mainline. If you doubt, ask Mingcong Bai to check if a revert
> there helps, too.

Do we need any further information/testing on this issue? Please let me 
know if there's anything we can do as the issue still persists in 6.12.

Best Regards,
Mingcong Bai

> 
> HTH, Ciao, Thorsten
> 
>> Also I'm failing to see an immediate issue between the below 
>> stacktrace
>> and the above commit. So are we sure about that reference?
>> 
>> Thanks.
>> 
>> 
>>>> v6.10.5, the Broadcom Tigon3 Ethernet interface (tg3) found on Apple
>>>> MacBook Pro (15'', Mid 2010) would throw many rcu stall errors 
>>>> during
>>>> boot up, causing peripherals such as the wireless card to misbehave:
>>>> 
>>>> [   24.153855] rcu: INFO: rcu_preempt detected expedited stalls on 
>>>> CPUs/tasks: { 2-.... } 21 jiffies s: 973 root: 0x4/.
>>>> [   24.166938] rcu: blocking rcu_node structures (internal RCU 
>>>> debug):
>>>> [   24.177800] Sending NMI from CPU 3 to CPUs 2:
>>>> [   24.183113] NMI backtrace for cpu 2
>>>> [   24.183119] CPU: 2 PID: 1049 Comm: NetworkManager Not tainted 
>>>> 6.10.5-aosc-main #1
>>>> [   24.183123] Hardware name: Apple Inc. MacBookPro6,2/Mac-F22586C8, 
>>>> BIOS    MBP61.88Z.005D.B00.1804100943 04/10/18
>>>> [   24.183125] RIP: 0010:__this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183135] Code: c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 
>>>> 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 89 f6 48 03 77 
>>>> 30 8b 06 <31> f6 31 ff c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 
>>>> 90 90 90
>>>> [   24.183138] RSP: 0018:ffffbf1a011d75e8 EFLAGS: 00000082
>>>> [   24.183141] RAX: 0000000000000000 RBX: ffffa04ec78f8a00 RCX: 
>>>> 0000000000000000
>>>> [   24.183143] RDX: 0000000000000000 RSI: ffffbf1a00fb007c RDI: 
>>>> ffffa04ec78f8a00
>>>> [   24.183145] RBP: 0000000000000b50 R08: 0000000000000000 R09: 
>>>> 0000000000000000
>>>> [   24.183147] R10: 0000000000000000 R11: 0000000000000000 R12: 
>>>> 0000000000000216
>>>> [   24.183148] R13: ffffbf1a011d7624 R14: ffffa04ec78f8a08 R15: 
>>>> ffffa04ec78f8b40
>>>> [   24.183151] FS:  00007f4c524b2140(0000) GS:ffffa05007d00000(0000) 
>>>> knlGS:0000000000000000
>>>> [   24.183153] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [   24.183155] CR2: 00007f7025eae3e8 CR3: 00000001040f8000 CR4: 
>>>> 00000000000006f0
>>>> [   24.183157] Call Trace:
>>>> [   24.183162]  <NMI>
>>>> [   24.183167]  ? nmi_cpu_backtrace+0xbf/0x140
>>>> [   24.183175]  ? nmi_cpu_backtrace_handler+0x11/0x20
>>>> [   24.183181]  ? nmi_handle+0x61/0x160
>>>> [   24.183186]  ? default_do_nmi+0x42/0x110
>>>> [   24.183191]  ? exc_nmi+0x1bd/0x290
>>>> [   24.183194]  ? end_repeat_nmi+0xf/0x53
>>>> [   24.183203]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183207]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183210]  ? __this_module+0x2d3d1/0x4f310 [tg3]
>>>> [   24.183213]  </NMI>
>>>> [   24.183214]  <TASK>
>>>> [   24.183215]  __this_module+0x31828/0x4f310 [tg3]
>>>> [   24.183218]  ? __this_module+0x2d390/0x4f310 [tg3]
>>>> [   24.183221]  __this_module+0x398e6/0x4f310 [tg3]
>>>> [   24.183225]  __this_module+0x3baf8/0x4f310 [tg3]
>>>> [   24.183229]  __this_module+0x4733f/0x4f310 [tg3]
>>>> [   24.183233]  ? _raw_spin_unlock_irqrestore+0x25/0x70
>>>> [   24.183237]  ? __this_module+0x398e6/0x4f310 [tg3]
>>>> [   24.183241]  __this_module+0x4b943/0x4f310 [tg3]
>>>> [   24.183244]  ? delay_tsc+0x89/0xf0
>>>> [   24.183249]  ? preempt_count_sub+0x51/0x60
>>>> [   24.183254]  __this_module+0x4be4b/0x4f310 [tg3]
>>>> [   24.183258]  __dev_open+0x103/0x1c0
>>>> [   24.183265]  __dev_change_flags+0x1bd/0x230
>>>> [   24.183269]  ? rtnl_getlink+0x362/0x400
>>>> [   24.183276]  dev_change_flags+0x26/0x70
>>>> [   24.183280]  do_setlink+0xe16/0x11f0
>>>> [   24.183286]  ? __nla_validate_parse+0x61/0xd40
>>>> [   24.183295]  __rtnl_newlink+0x63d/0x9f0
>>>> [   24.183301]  ? kmem_cache_alloc_node_noprof+0x12b/0x360
>>>> [   24.183308]  ? kmalloc_trace_noprof+0x11e/0x350
>>>> [   24.183312]  ? rtnl_newlink+0x2e/0x70
>>>> [   24.183316]  rtnl_newlink+0x47/0x70
>>>> [   24.183320]  rtnetlink_rcv_msg+0x152/0x400
>>>> [   24.183324]  ? __netlink_sendskb+0x68/0x90
>>>> [   24.183329]  ? netlink_unicast+0x237/0x290
>>>> [   24.183333]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
>>>> [   24.183336]  netlink_rcv_skb+0x5b/0x110
>>>> [   24.183343]  netlink_unicast+0x1a4/0x290
>>>> [   24.183347]  netlink_sendmsg+0x222/0x4a0
>>>> [   24.183350]  ? proc_get_long.constprop.0+0x116/0x210
>>>> [   24.183358]  ____sys_sendmsg+0x379/0x3b0
>>>> [   24.183363]  ? copy_msghdr_from_user+0x6d/0xb0
>>>> [   24.183368]  ___sys_sendmsg+0x86/0xe0
>>>> [   24.183372]  ? addrconf_sysctl_forward+0xf3/0x270
>>>> [   24.183378]  ? _copy_from_iter+0x8b/0x570
>>>> [   24.183384]  ? __pfx_addrconf_sysctl_forward+0x10/0x10
>>>> [   24.183388]  ? _raw_spin_unlock+0x19/0x50
>>>> [   24.183392]  ? proc_sys_call_handler+0xf3/0x2f0
>>>> [   24.183397]  ? trace_hardirqs_on+0x29/0x90
>>>> [   24.183401]  ? __fdget+0xc2/0xf0
>>>> [   24.183405]  __sys_sendmsg+0x5b/0xc0
>>>> [   24.183410]  ? syscall_trace_enter+0x110/0x1b0
>>>> [   24.183416]  do_syscall_64+0x64/0x150
>>>> [   24.183423]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>> 
>>>> I have bisected the error to this commit. Reverting it caused no new 
>>>> or
>>>> perceivable issues on both the MacBook and a Zen4-based laptop.
>>> 
>>> [...]
>>> 
>>>>> Ohh, and when you say "causing peripherals such as the wireless 
>>>>> card to
>>>>> misbehave" what exactly do you mean?
>>>> 
>>>> When the kernel throws rcu stall messages, the wireless card on the
>>>> MacBook may fail to discover and/or connect to wireless networks - 
>>>> not a
>>>> consistent behaviour but I suppose that something in the kernel got 
>>>> stuck.
>>> 
>>> See the ticket for more details and dmesg logs; the problem still
>>> happens with 6.12-rc. The reporter is CCed.
>>> 
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' 
>>> hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>> 
>>> P.S.: let me use this mail to also add the report to the list of 
>>> tracked
>>> regressions to ensure it's doesn't fall through the cracks:
>>> 
>>> #regzbot introduced: 55d4669ef1b76823083caecfab12a8bd2ccdcf64
>>> #regzbot from: Mingcong Bai <jeffbai@...c.io>
>>> #regzbot duplicate: 
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219390
>>> #regzbot title: rcu: wifi problems since tg3 started throwing rcu 
>>> stall
>>> warnings
>>> #regzbot ignore-activity
>>> 
>> 
>> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ