[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdWKUcK+fESrcLaH@xpf.sh.intel.com>
Date: Wed, 21 Feb 2024 13:29:53 +0800
From: Pengfei Xu <pengfei.xu@...el.com>
To: Jason Xing <kerneljasonxing@...il.com>
CC: "kuniyu@...zon.com" <kuniyu@...zon.com>, "davem@...emloft.net"
<davem@...emloft.net>, "dumazet@...gle.com" <dumazet@...gle.com>,
"kuba@...nel.org" <kuba@...nel.org>, "pabeni@...hat.com" <pabeni@...hat.com>,
"kuni1840@...il.com" <kuni1840@...il.com>, "netdev@...r.kernel.org"
<netdev@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, lkp <lkp@...el.com>
Subject: Re: [Syzkaller & bisect] There is BUG: soft lockup after sendmsg
syscall in v6.8-rc4
Hi Jsason,
On 2024-02-21 at 11:07:44 +0800, Jason Xing wrote:
> On Wed, Feb 21, 2024 at 10:35 AM Pengfei Xu <pengfei.xu@...el.com> wrote:
> >
> > Hi Kuniyuki Iwashima and kernel experts,
> >
> > Greeting!
> > There is BUG: soft lockup after sendmsg syscall in v6.8-rc4 in guest.
> >
> > All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240220_161242_softlockup
> > Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/repro.c
> > Syzkaller syscall reproduced steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/repro.prog
> > Kconfig(need make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/kconfig_origin
> > Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/bisect_info.log
> > v6.8-rc4 issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/841c35169323cd833294798e58b9bf63fa4fa1de_dmesg.log
> > bzImage_v6.8-rc4: https://github.com/xupengfe/syzkaller_logs/raw/main/240220_161242_softlockup/bzImage_v6.8-rc4.tar.gz
> >
> > Bisected and found first bad commit:
> > "
> > 1279f9d9dec2 af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.
> > "
> >
> > After reverted above commit on top of v6.8-rc4 kernel, this issue was gone.
> >
> > Syzkaller repro.report: https://github.com/xupengfe/syzkaller_logs/blob/main/240220_161242_softlockup/repro.report
> > "
> > watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [gdbus:349]
> > Modules linked in:
> > irq event stamp: 25162
> > hardirqs last enabled at (25161): [<ffffffff855dff2e>] irqentry_exit+0x3e/0xa0 kernel/entry/common.c:351
> > hardirqs last disabled at (25162): [<ffffffff855dded4>] sysvec_apic_timer_interrupt+0x14/0xc0 arch/x86/kernel/apic/apic.c:1076
> > softirqs last enabled at (25140): [<ffffffff8127fcc8>] invoke_softirq kernel/softirq.c:427 [inline]
> > softirqs last enabled at (25140): [<ffffffff8127fcc8>] __irq_exit_rcu+0xa8/0x110 kernel/softirq.c:632
> > softirqs last disabled at (25135): [<ffffffff8127fcc8>] invoke_softirq kernel/softirq.c:427 [inline]
> > softirqs last disabled at (25135): [<ffffffff8127fcc8>] __irq_exit_rcu+0xa8/0x110 kernel/softirq.c:632
> > CPU: 0 PID: 349 Comm: gdbus Not tainted 6.8.0-rc4-2024-02-12-intel-next-e92619743fcb+ #1
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:__sanitizer_cov_trace_pc+0x38/0x70 kernel/kcov.c:207
> > Code: 65 8b 05 73 89 a0 7e 48 89 e5 48 8b 75 08 a9 00 01 ff 00 74 0f f6 c4 01 74 35 8b 82 e4 1d 00 00 85 c0 74 2b 8b 82 c0 1d 00 00 <83> f8 02 75 20 48 8b 8a c8 1d 00 00 8b 92 c4 1d 00 00 48 8b 01 48
> > RSP: 0018:ffff88800b48f7b0 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff888013018000 RCX: 1ffffffff12150bb
> > RDX: ffff888011ab8000 RSI: ffffffff8515e235 RDI: ffff888013018770
> > RBP: ffff88800b48f7b0 R08: 0000000000000001 R09: fffffbfff120f86e
> > R10: ffffffff8907c377 R11: 0000000000000001 R12: ffff888013018000
> > R13: dffffc0000000000 R14: ffff888013018630 R15: ffff88800b48f840
> > FS: 0000000000000000(0000) GS:ffff88806ca00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007fffcc5fd648 CR3: 0000000006c82000 CR4: 0000000000750ef0
> > PKRU: 55555554
> > Call Trace:
> > <IRQ>
> > </IRQ>
> > <TASK>
> > unix_gc+0x465/0xfd0 net/unix/garbage.c:319
> > unix_release_sock+0xb8c/0xe80 net/unix/af_unix.c:683
> > unix_release+0x9c/0x100 net/unix/af_unix.c:1064
> > __sock_release+0xb6/0x280 net/socket.c:659
> > sock_close+0x27/0x40 net/socket.c:1421
> > __fput+0x284/0xb70 fs/file_table.c:376
> > ____fput+0x1f/0x30 fs/file_table.c:404
> > task_work_run+0x19d/0x2b0 kernel/task_work.c:180
> > exit_task_work include/linux/task_work.h:38 [inline]
> > do_exit+0xb51/0x28c0 kernel/exit.c:871
> > do_group_exit+0xe5/0x2c0 kernel/exit.c:1020
> > get_signal+0x2715/0x27d0 kernel/signal.c:2893
> > arch_do_signal_or_restart+0x8e/0x7e0 arch/x86/kernel/signal.c:311
> > exit_to_user_mode_loop kernel/entry/common.c:105 [inline]
> > exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
> > __syscall_exit_to_user_mode_work kernel/entry/common.c:201 [inline]
> > syscall_exit_to_user_mode+0x129/0x190 kernel/entry/common.c:212
> > do_syscall_64+0x82/0x150 arch/x86/entry/common.c:89
> > entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > RIP: 0033:0x7fe8f8b4296f
> > Code: Unable to access opcode bytes at 0x7fe8f8b42945.
> > RSP: 002b:00007fe8d7dff9c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
> > RAX: fffffffffffffdfc RBX: 00007fe8f8f73071 RCX: 00007fe8f8b4296f
> > RDX: 00000000ffffffff RSI: 0000000000000002 RDI: 00007fe8a801f3f0
> > RBP: 00007fe8a801f3f0 R08: 0000000000000000 R09: 00007fe8d7dff850
> > R10: 00007ffec9196080 R11: 0000000000000293 R12: 0000000000000002
> > R13: 0000000000000002 R14: 00007fe8d7dffa30 R15: 00007fe8a801c4c0
> > </TASK>
> > "
>
> Hello, did you try this one:
>
> commit 25236c91b5ab4a26a56ba2e79b8060cf4e047839
> Author: Kuniyuki Iwashima <kuniyu@...zon.com>
> Date: Fri Feb 9 14:04:53 2024 -0800
>
> af_unix: Fix task hung while purging oob_skb in GC.
>
> syzbot reported a task hung; at the same time, GC was looping infinitely
> in list_for_each_entry_safe() for OOB skb. [0]
>
> syzbot demonstrated that the list_for_each_entry_safe() was not actually
> safe in this case.
>
> A single skb could have references for multiple sockets. If we free such
> a skb in the list_for_each_entry_safe(), the current and next sockets could
> be unlinked in a single iteration.
>
> It should work, I think :)
Thanks for your information!
I tried the v6.8-rc5 kernel which contains the commit 25236c91b5ab4 and
the changed code in file net/unix/garbage.c was same.
The problem can still be reproduced in v6.8-rc5, seems there is still some
problem left.
"
[ 192.656972] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [GUsbEventThread:601]
[ 192.657524] Modules linked in:
[ 192.657714] irq event stamp: 23074
[ 192.657914] hardirqs last enabled at (23073): [<ffffffff85584fce>] irqentry_exit+0x3e/0xa0
[ 192.658446] hardirqs last disabled at (23074): [<ffffffff85582dc3>] sysvec_apic_timer_interrupt+0x13/0xd0
[ 192.659023] softirqs last enabled at (23058): [<ffffffff855cd1a0>] __do_softirq+0x4f0/0x791
[ 192.659542] softirqs last disabled at (23053): [<ffffffff8126ab41>] irq_exit_rcu+0xc1/0x110
[ 192.660058] CPU: 0 PID: 601 Comm: GUsbEventThread Not tainted 6.8.0-rc5-kvm #15
[ 192.660490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 192.661163] RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x70
[ 192.661485] Code: 48 8b 80 d8 1d 00 00 c3 cc cc cc cc 31 c0 5d c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 8
[ 192.662586] RSP: 0018:ffff888007967798 EFLAGS: 00000246
[ 192.662909] RAX: ffff88800fe12500 RBX: ffffed1000f67eee RCX: ffffffff84b2f34f
[ 192.663337] RDX: 0000000000000000 RSI: ffffffff85103edf RDI: 0000000000000005
[ 192.663763] RBP: ffff888007967918 R08: ffff88800fe12500 R09: ffffed10035c57d5
[ 192.664187] R10: ffffed10035c57d4 R11: ffff88801ae2bea7 R12: ffff888007b3f630
[ 192.664611] R13: ffff888007b3f770 R14: dffffc0000000000 R15: 0000000000000000
[ 192.665061] FS: 0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
[ 192.665549] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 192.665896] CR2: 00007f34b0ccd274 CR3: 0000000006a7e000 CR4: 0000000000750ef0
[ 192.666350] PKRU: 55555554
[ 192.666532] Call Trace:
[ 192.666690] <IRQ>
[ 192.666824] ? show_regs+0xa2/0xb0
[ 192.667056] ? watchdog_timer_fn+0x507/0x6a0
[ 192.667325] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 192.667620] ? __hrtimer_run_queues+0x61a/0xba0
[ 192.667920] ? __pfx___hrtimer_run_queues+0x10/0x10
[ 192.668229] ? ktime_get_update_offsets_now+0x25b/0x340
[ 192.668553] ? hrtimer_interrupt+0x326/0x7b0
[ 192.668832] ? __sysvec_apic_timer_interrupt+0x101/0x390
[ 192.669163] ? sysvec_apic_timer_interrupt+0x9c/0xd0
[ 192.669466] </IRQ>
[ 192.669605] <TASK>
[ 192.669745] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 192.670092] ? kfree_skb_reason+0x2af/0x2f0
[ 192.670360] ? unix_gc+0x98f/0x1120
[ 192.670591] ? __pfx___sanitizer_cov_trace_pc+0x10/0x10
[ 192.670926] ? unix_gc+0x9ae/0x1120
[ 192.671155] ? __pfx_unix_gc+0x10/0x10
[ 192.671402] ? mark_held_locks+0xb7/0x140
[ 192.671662] ? skb_dequeue+0x168/0x1e0
[ 192.671901] ? __this_cpu_preempt_check+0x20/0x30
[ 192.672213] ? fib_table_lookup+0xaf0/0x2590
[ 192.672482] ? skb_dequeue+0x168/0x1e0
[ 192.672723] unix_release_sock+0xb7a/0xfd0
[ 192.672988] ? __pfx_unix_release_sock+0x10/0x10
[ 192.673281] ? down_write+0x183/0x1e0
[ 192.673510] ? __pfx_down_write+0x10/0x10
[ 192.673766] unix_release+0x9c/0x100
[ 192.674009] __sock_release+0xba/0x280
[ 192.674250] ? __pfx_sock_close+0x10/0x10
[ 192.674499] sock_close+0x25/0x30
[ 192.674713] __fput+0x273/0xb70
[ 192.674933] ____fput+0x1e/0x30
[ 192.675140] task_work_run+0x1a4/0x2d0
[ 192.675380] ? __pfx_task_work_run+0x10/0x10
[ 192.675644] ? free_nsproxy+0x3c5/0x4f0
[ 192.675885] ? switch_task_namespaces+0xf7/0x130
[ 192.676174] do_exit+0xae1/0x3000
[ 192.676391] ? __this_cpu_preempt_check+0x20/0x30
[ 192.676675] ? lock_release+0x3f8/0x770
[ 192.676925] ? __this_cpu_preempt_check+0x20/0x30
[ 192.677221] ? get_signal+0xa87/0x2740
[ 192.677466] ? __pfx_do_exit+0x10/0x10
[ 192.677704] ? _raw_spin_unlock_irq+0x2b/0x60
[ 192.677983] ? get_signal+0xa87/0x2740
[ 192.678227] ? __this_cpu_preempt_check+0x20/0x30
[ 192.678528] do_group_exit+0xe0/0x2b0
[ 192.678764] get_signal+0x26f7/0x2740
[ 192.679004] ? poll_select_set_timeout+0xf5/0x160
[ 192.679298] ? __pfx_lock_release+0x10/0x10
[ 192.679564] ? __sanitizer_cov_trace_cmp8+0x1c/0x30
[ 192.679869] ? __pfx_get_signal+0x10/0x10
[ 192.680130] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 192.680462] ? kvm_clock_read+0x3f/0x60
[ 192.680707] arch_do_signal_or_restart+0x87/0x930
[ 192.681007] ? __pfx_arch_do_signal_or_restart+0x10/0x10
[ 192.681336] ? __this_cpu_preempt_check+0x20/0x30
[ 192.681636] ? lockdep_hardirqs_on+0x8a/0x110
[ 192.681917] syscall_exit_to_user_mode+0x178/0x260
[ 192.682222] do_syscall_64+0x83/0x150
[ 192.682459] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 192.682764] RIP: 0033:0x7f34b134296f
[ 192.682990] Code: Unable to access opcode bytes at 0x7f34b1342945.
[ 192.683336] RSP: 002b:00007f34ae9ff9b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[ 192.683766] RAX: fffffffffffffdfc RBX: 0000564d3236d8a0 RCX: 00007f34b134296f
[ 192.684183] RDX: 00000000000007d0 RSI: 0000000000000002 RDI: 00007f349c000b60
[ 192.684601] RBP: 0000564d3236da48 R08: 0000000000000000 R09: 00000000000007d0
[ 192.685017] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000007d0
[ 192.685423] R13: 0000000000000002 R14: 00007f34b45a6340 R15: 00007f349c000b60
[ 192.685846] </TASK>
[ 192.685989] Kernel panic - not syncing: softlockup: hung tasks
[ 192.686326] CPU: 0 PID: 601 Comm: GUsbEventThread Tainted: G L 6.8.0-rc5-kvm #15
[ 192.686818] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 192.687464] Call Trace:
[ 192.687613] <IRQ>
[ 192.687741] dump_stack_lvl+0xa9/0x110
[ 192.687979] dump_stack+0x19/0x20
[ 192.688186] panic+0x6b9/0x720
[ 192.688374] ? __pfx_panic+0x10/0x10
[ 192.688604] ? watchdog_timer_fn+0x571/0x6a0
[ 192.688867] ? watchdog_timer_fn+0x564/0x6a0
[ 192.689144] watchdog_timer_fn+0x582/0x6a0
[ 192.689390] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 192.689667] __hrtimer_run_queues+0x61a/0xba0
[ 192.689949] ? __pfx___hrtimer_run_queues+0x10/0x10
[ 192.690241] ? ktime_get_update_offsets_now+0x25b/0x340
[ 192.690567] hrtimer_interrupt+0x326/0x7b0
[ 192.690823] __sysvec_apic_timer_interrupt+0x101/0x390
[ 192.691131] sysvec_apic_timer_interrupt+0x9c/0xd0
[ 192.691415] </IRQ>
[ 192.691543] <TASK>
[ 192.691675] asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 192.691986] RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x70
[ 192.692301] Code: 48 8b 80 d8 1d 00 00 c3 cc cc cc cc 31 c0 5d c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 8
[ 192.693367] RSP: 0018:ffff888007967798 EFLAGS: 00000246
[ 192.693680] RAX: ffff88800fe12500 RBX: ffffed1000f67eee RCX: ffffffff84b2f34f
[ 192.694101] RDX: 0000000000000000 RSI: ffffffff85103edf RDI: 0000000000000005
[ 192.694505] RBP: ffff888007967918 R08: ffff88800fe12500 R09: ffffed10035c57d5
[ 192.694917] R10: ffffed10035c57d4 R11: ffff88801ae2bea7 R12: ffff888007b3f630
[ 192.695338] R13: ffff888007b3f770 R14: dffffc0000000000 R15: 0000000000000000
[ 192.695757] ? kfree_skb_reason+0x2af/0x2f0
[ 192.696020] ? unix_gc+0x98f/0x1120
[ 192.696237] ? unix_gc+0x9ae/0x1120
[ 192.696461] ? __pfx_unix_gc+0x10/0x10
[ 192.696701] ? mark_held_locks+0xb7/0x140
[ 192.696963] ? skb_dequeue+0x168/0x1e0
[ 192.697196] ? __this_cpu_preempt_check+0x20/0x30
[ 192.697493] ? fib_table_lookup+0xaf0/0x2590
[ 192.697757] ? skb_dequeue+0x168/0x1e0
[ 192.698000] unix_release_sock+0xb7a/0xfd0
[ 192.698264] ? __pfx_unix_release_sock+0x10/0x10
[ 192.698548] ? down_write+0x183/0x1e0
[ 192.698771] ? __pfx_down_write+0x10/0x10
[ 192.699026] unix_release+0x9c/0x100
[ 192.699254] __sock_release+0xba/0x280
[ 192.699490] ? __pfx_sock_close+0x10/0x10
[ 192.699742] sock_close+0x25/0x30
[ 192.699965] __fput+0x273/0xb70
[ 192.700170] ____fput+0x1e/0x30
[ 192.700365] task_work_run+0x1a4/0x2d0
[ 192.700603] ? __pfx_task_work_run+0x10/0x10
[ 192.700864] ? free_nsproxy+0x3c5/0x4f0
[ 192.701106] ? switch_task_namespaces+0xf7/0x130
[ 192.701387] do_exit+0xae1/0x3000
[ 192.701592] ? __this_cpu_preempt_check+0x20/0x30
[ 192.701879] ? lock_release+0x3f8/0x770
[ 192.702128] ? __this_cpu_preempt_check+0x20/0x30
[ 192.702420] ? get_signal+0xa87/0x2740
[ 192.702662] ? __pfx_do_exit+0x10/0x10
[ 192.702907] ? _raw_spin_unlock_irq+0x2b/0x60
[ 192.703185] ? get_signal+0xa87/0x2740
[ 192.703428] ? __this_cpu_preempt_check+0x20/0x30
[ 192.703723] do_group_exit+0xe0/0x2b0
[ 192.703966] get_signal+0x26f7/0x2740
[ 192.704203] ? poll_select_set_timeout+0xf5/0x160
[ 192.704490] ? __pfx_lock_release+0x10/0x10
[ 192.704763] ? __sanitizer_cov_trace_cmp8+0x1c/0x30
[ 192.705072] ? __pfx_get_signal+0x10/0x10
[ 192.705324] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 192.705649] ? kvm_clock_read+0x3f/0x60
[ 192.705900] arch_do_signal_or_restart+0x87/0x930
[ 192.706207] ? __pfx_arch_do_signal_or_restart+0x10/0x10
[ 192.706538] ? __this_cpu_preempt_check+0x20/0x30
[ 192.706833] ? lockdep_hardirqs_on+0x8a/0x110
[ 192.707120] syscall_exit_to_user_mode+0x178/0x260
[ 192.707418] do_syscall_64+0x83/0x150
[ 192.707652] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 192.707978] RIP: 0033:0x7f34b134296f
[ 192.708199] Code: Unable to access opcode bytes at 0x7f34b1342945.
[ 192.708569] RSP: 002b:00007f34ae9ff9b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[ 192.709020] RAX: fffffffffffffdfc RBX: 0000564d3236d8a0 RCX: 00007f34b134296f
[ 192.709449] RDX: 00000000000007d0 RSI: 0000000000000002 RDI: 00007f349c000b60
[ 192.709873] RBP: 0000564d3236da48 R08: 0000000000000000 R09: 00000000000007d0
[ 192.710313] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000007d0
[ 192.710746] R13: 0000000000000002 R14: 00007f34b45a6340 R15: 00007f349c000b60
[ 192.711205] </TASK>
[ 192.711808] Kernel Offset: disabled
[ 192.712120] ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
"
Best Regards,
Thanks!
>
> Thanks,
> Jason
>
> >
> > Thanks!
> >
> > ---
Powered by blists - more mailing lists