[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a354fadd-268f-8119-d37a-102e5efa1437@quicinc.com>
Date: Mon, 18 Oct 2021 23:47:32 -0400
From: Qian Cai <quic_qiancai@...cinc.com>
To: Peter Zijlstra <peterz@...radead.org>, <gor@...ux.ibm.com>,
<jpoimboe@...hat.com>, <jikos@...nel.org>, <mbenes@...e.cz>,
<pmladek@...e.com>, <mingo@...nel.org>
CC: <linux-kernel@...r.kernel.org>, <joe.lawrence@...hat.com>,
<fweisbec@...il.com>, <tglx@...utronix.de>, <hca@...ux.ibm.com>,
<svens@...ux.ibm.com>, <sumanthk@...ux.ibm.com>,
<live-patching@...r.kernel.org>, <paulmck@...nel.org>,
<rostedt@...dmis.org>, <x86@...nel.org>
Subject: Re: [PATCH v2 04/11] sched: Simplify wake_up_*idle*()
Peter, any thoughts? I did confirm that reverting the commit fixed the issue.
On 10/13/2021 10:32 AM, Qian Cai wrote:
>
>
> On 9/29/2021 11:17 AM, Peter Zijlstra wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -1170,14 +1170,14 @@ void wake_up_all_idle_cpus(void)
>> {
>> int cpu;
>>
>> - preempt_disable();
>> + cpus_read_lock();
>> for_each_online_cpu(cpu) {
>> - if (cpu == smp_processor_id())
>> + if (cpu == raw_smp_processor_id())
>> continue;
>>
>> wake_up_if_idle(cpu);
>> }
>> - preempt_enable();
>> + cpus_read_unlock();
>
> Peter, it looks like this thing introduced a deadlock during CPU online/offline.
>
> [ 630.145166][ T129] WARNING: possible recursive locking detected
> [ 630.151164][ T129] 5.15.0-rc5-next-20211013+ #145 Not tainted
> [ 630.156988][ T129] --------------------------------------------
> [ 630.162984][ T129] cpuhp/21/129 is trying to acquire lock:
> [ 630.168547][ T129] ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: wake_up_all_idle_cpus+0x40/0xe8
> wake_up_all_idle_cpus at /usr/src/linux-next/kernel/smp.c:1174
> [ 630.178040][ T129]
> [ 630.178040][){++++}-{0:0}, at help us debug this:
> [ 630.202292][ T129] Possible unsafe locking scenario:
> [ 630.202292][ T129]
> [ 630.209590][ T129] CPU0
> [ 630.212720][ T129] ----
> [ 630.215851][ T129] lock(cpu_hotplug_lock);
> [ 630.220202][ T129] lock(cpu_hotplug_lock);
> [ 630.224553][ T129]
> [ 630.224553][ T129] *** DEADLOCK ***
> [ 630.224553][ T129]
> [ 630.232545][ T129] May be due to missing lock nesting notation
> [ 630.232545][ T129]
> [ 630.240711][ T129] 3 locks held by cpuhp/21/129:
> [ 630.245406][ T129] #0: ffff800011f466d0 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [ 630.254976][ T129] #1: ffff800011f46780 (cpuhp_state-up){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x588
> [ 630.264372][ T129] #2: ffff8000191fb9c8 (cpuidle_lock){+.+.}-{3:3}, at: cpuidle_pause_and_lock+0x24/0x38
> [ 630.274031][ T129]
> [ 630.274031][ T129] stack backtrace:
> [ 630.279767][ T129] CPU: 21 PID: 129 Comm: cpuhp/21 Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 630.288371][ T129] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> [ 630.296886][ T129] Call trace:
> [ 630.300017][ T129] dump_backtrace+0x0/0x3b8
> [ 630.304369][ T129] show_stack+0x20/0x30
> [ 630.308371][ T129] dump_stack_lvl+0x8c/0xb8
> [ 630.312722][ T129] dump_stack+0x1c/0x38
> [ 630.316723][ T129] validate_chain+0x1d84/0x1da0
> [ 630.321421][ T129] __lock_acquire+0xab0/0x2040
> [ 630.326033][ T129] lock_acquire+0x32c/0xb08
> [ 630.330390][ T129] cpus_read_lock+0x94/0x308
> [ 630.334827][ T129] wake_up_all_idle_cpus+0x40/0xe8
> [ 630.339784][ T129] cpuidle_uninstall_idle_handler+0x3c/0x50
> [ 630.345524][ T129] cpuidle_pause_and_lock+0x28/0x38
> [ 630.350569][ T129] acpi_processor_hotplug+0xc0/0x170
> [ 630.355701][ T129] acpi_soft_cpu_online+0x124/0x250
> [ 630.360745][ T129] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 630.365963][ T129] cpuhp_thread_fun+0x204/0x588
> [ 630.370659][ T129] smpboot_thread_fn+0x3f0/0xc40
> [ 630.375444][ T129] kthread+0x3d8/0x488
> [ 630.379360][ T129] ret_from_fork+0x10/0x20
> [ 863.525716][ T191] INFO: task cpuhp/21:129 blocked for more than 122 seconds.
> [ 863.532954][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.539361][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.547927][ T191] task:cpuhp/21 state:D stack:59104 pid: 129 ppid: 2 flags:0x00000008
> [ 863.557029][ T191] Call trace:
> [ 863.560171][ T191] __switch_to+0x184/0x400
> [ 863.564448][ T191] __schedule+0x74c/0x1940
> [ 863.568753][ T191] schedule+0x110/0x318
> [ 863.572764][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.577592][ T191] __percpu_down_read+0xb0/0x148
> [ 863.582386][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.586961][ T191] wake_up_all_idle_cpus+0x40/0xe8
> [ 863.591931][ T191] cpuidle_uninstall_idle_handler+0x3c/0x50
> [ 863.597716][ T191] cpuidle_pause_and_lock+0x28/0x38
> [ 863.602771][ T191] acpi_processor_hotplug+0xc0/0x170
> [ 863.607946][ T191] acpi_soft_cpu_online+0x124/0x250
> [ 863.613001][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 863.618261][ T191] cpuhp_thread_fun+0x204/0x588
> [ 863.622967][ T191] smpboot_thread_fn+0x3f0/0xc40
> [ 863.627787][ T191] kthread+0x3d8/0x488
> [ 863.631712][ T191] ret_from_fork+0x10/0x20
> [ 863.636020][ T191] INFO: task kworker/0:2:189 blocked for more than 122 seconds.
> [ 863.643500][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.649882][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.658425][ T191] task:kworker/0:2 state:D stack:58368 pid: 189 ppid: 2 flags:0x00000008
> [ 863.667516][ T191] Workqueue: events vmstat_shepherd
> [ 863.672573][ T191] Call trace:
> [ 863.675731][ T191] __switch_to+0x184/0x400
> [ 863.680001][ T191] __schedule+0x74c/0x1940
> [ 863.684268][ T191] schedule+0x110/0x318
> [ 863.688295][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.693085][ T191] __percpu_down_read+0xb0/0x148
> [ 863.697892][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.702421][ T191] vmstat_shepherd+0x5c/0x1a8
> [ 863.706977][ T191] process_one_work+0x808/0x19d0
> [ 863.711767][ T191] worker_thread+0x334/0xae0
> [ 863.716227][ T191] kthread+0x3d8/0x488
> [ 863.720149][ T191] ret_from_fork+0x10/0x20
> [ 863.724487][ T191] INFO: task lsbug:4642 blocked for more than 123 seconds.
> [ 863.731565][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.737938][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.746490][ T191] task:lsbug state:D stack:55536 pid: 4642 ppid: 4638 flags:0x00000008
> [ 863.755549][ T191] Call trace:
> [ 863.758712][ T191] __switch_to+0x184/0x400
> [ 863.762984][ T191] __schedule+0x74c/0x1940
> [ 863.767286][ T191] schedule+0x110/0x318
> [ 863.771294][ T191] schedule_timeout+0x188/0x238
> [ 863.776016][ T191] wait_for_completion+0x174/0x290
> [ 863.780979][ T191] __cpuhp_kick_ap+0x158/0x1a8
> [ 863.785592][ T191] cpuhp_kick_ap+0x1f0/0x828
> [ 863.790053][ T191] bringup_cpu+0x180/0x1e0
> [ 863.794320][ T191] cpuhp_invoke_callback+0x51c/0x2ab8
> [ 863.799561][ T191] cpuhp_invoke_callback_range+0xa4/0x108
> [ 863.805130][ T191] cpu_up+0x528/0xd78
> [ 863.808982][ T191] cpu_device_up+0x4c/0x68
> [ 863.813249][ T191] cpu_subsys_online+0xc0/0x1f8
> [ 863.817972][ T191] device_online+0x10c/0x180
> [ 863.822413][ T191] online_store+0x10c/0x118
> [ 863.826791][ T191] dev_attr_store+0x44/0x78
> [ 863.831148][ T191] sysfs_kf_write+0xe8/0x138
> [ 863.835590][ T191] kernfs_fop_write_iter+0x26c/0x3d0
> [ 863.840745][ T191] new_sync_write+0x2bc/0x4f8
> [ 863.845275][ T191] vfs_write+0x714/0xcd8
> [ 863.849387][ T191] ksys_write+0xf8/0x1e0
> [ 863.853481][ T191] __arm64_sys_write+0x74/0xa8
> [ 863.858113][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
> [ 863.863597][ T191] do_el0_svc+0xe4/0x298
> [ 863.867710][ T191] el0_svc+0x64/0x130
> [ 863.871545][ T191] el0t_64_sync_handler+0xb0/0xb8
> [ 863.876437][ T191] el0t_64_sync+0x180/0x184
> [ 863.880798][ T191] INFO: task mount:4682 blocked for more than 123 seconds.
> [ 863.887881][ T191] Not tainted 5.15.0-rc5-next-20211013+ #145
> [ 863.894232][ T191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 863.902776][ T191] task:mount state:D stack:55856 pid: 4682 ppid: 1101 flags:0x00000000
> [ 863.911865][ T191] Call trace:
> [ 863.915003][ T191] __switch_to+0x184/0x400
> [ 863.919296][ T191] __schedule+0x74c/0x1940
> [ 863.923564][ T191] schedule+0x110/0x318
> [ 863.927590][ T191] percpu_rwsem_wait+0x1b8/0x348
> [ 863.932380][ T191] __percpu_down_read+0xb0/0x148
> [ 863.937187][ T191] cpus_read_lock+0x2b0/0x308
> [ 863.941715][ T191] alloc_workqueue+0x730/0xd48
> [ 863.946357][ T191] loop_configure+0x2d4/0x1180 [loop]
> [ 863.951592][ T191] lo_ioctl+0x5dc/0x1228 [loop]
> [ 863.956321][ T191] blkdev_ioctl+0x258/0x820
> [ 863.960678][ T191] __arm64_sys_ioctl+0x114/0x180
> [ 863.965468][ T191] invoke_syscall.constprop.0+0xdc/0x1d8
> [ 863.970974][ T191] do_el0_svc+0xe4/0x298
> [ 863.975069][ T191] el0_svc+0x64/0x130
> [ 863.978922][ T191] el0t_64_sync_handler+0xb0/0xb8
> [ 863.983798][ T191] el0t_64_sync+0x180/0x184
> [ 863.988172][ T191] INFO: lockdep is turned off.
>
Powered by blists - more mailing lists