linux-kernel - Re: [PATCH] sched/core: fix affine_move

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmh4jd0da97.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 20 Mar 2024 19:32:52 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Daniel Vacek <neelx@...hat.com>, Ingo Molnar <mingo@...hat.com>, Peter
 Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Daniel Bristot
 de Oliveira <bristot@...hat.com>
Cc: Daniel Vacek <neelx@...hat.com>, stable@...r.kernel.org, Bill Peters
 <wpeters@...co.net>, Ingo Molnar <mingo@...nel.org>,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/core: fix affine_move_task failure case

On 20/03/24 19:17, Valentin Schneider wrote:
> On 20/03/24 15:03, Daniel Vacek wrote:
>> Hi Valentin,
>>
>> On Mon, Mar 18, 2024 at 6:34 PM Valentin Schneider <vschneid@...hat.com> wrote:
>>> On 18/03/24 12:17, Daniel Vacek wrote:
>>> > Bill Peters reported CPU hangs while offlining/onlining CPUs on s390.
>>> >
>>> > Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()`
>>> > can fail when racing with off-/on-lining resulting in a deadlock waiting for
>>> > the pending migration stop work completion which is never done.
>>> >
>>> > Fix this by correctly handling such a condition.
>>> >
>>>
>>> IIUC the problem is that the dest_cpu and its stopper thread can be taken
>>> down by take_cpu_down(), and affine_move_task() currently isn't aware of
>>> that. I thought we had tested this vs hotplug, but oh well...
>>
>> I'm sorry, I should have provided more context in the first place. The machine
>> is an LPAR with 2 CPUs and CPU 0 was onlining (hotplugging?) CPU 1. The traces
>> show this scenario:
>>
>>             CPU 0                       |           CPU 1
>>                                         |
>>         cpuplugd task 1429              |
>>  holds the `cpu_hotplug_lock`           |
>>     for writing in _cpu_up+0x16a        |
>> blocked on `cpuhp_state:1.done_up`      |
>>      completion in __cpuhp_kick_ap+0x76 |
>>                                         |
>>                                         |       cpuhp/1 task 17
>>                                         |supposed to complete bringup of the CPU
>>                                         |     (`cpuhp_state:1.done_up`) in cpuhp_thread_fun+0x108
>>                                         |blocked on `wq_pool_attach_mutex`
>>                                         |    in workqueue_online_cpu+0x9e
>>                                         |
>>         xfs-conv/dm-0 task 745          |
>>  holds the `wq_pool_attach_mutex`       |
>>     in worker_attach_to_pool+0x66        \
>> blocked on `task->migration_pending->done`|
>>      completion in affine_move_task+0x10a/
>>
>> ~~~
>> crash> b 1429
>> PID: 1429     TASK: 99398000          CPU: 0    COMMAND: "cpuplugd"
>>  #0 [997df970] __schedule+0x34c at 3089c424
>>  #1 [997df9e0] schedule+0x7e at 3089cafe
>>  #2 [997dfa20] schedule_timeout+0x26e at 308a1d8e
>>      [inlined] do_wait_for_common
>>      [inlined] __wait_for_common
>>  #3 [997dfad8] wait_for_common+0x14a at 3089d902
>>     [ret call] wait_for_completion+0x1a at 3089d96a
>>
>>      [inlined] wait_for_ap_thread                      <<< blocked on `cpuhp_state:1.done_up` completion
>>     [ret call] __cpuhp_kick_ap+0x76 at 300c610e
>>  #4 [997dfb58] cpuhp_kick_ap+0xc4 at 300c61dc
>>      [inlined] bringup_wait_for_ap
>>     [ret call] bringup_cpu+0xea at 300c6402
>>  #5 [997dfba8] cpuhp_invoke_callback+0xcc at 300c4f14
>>  #6 [997dfc40] _cpu_up+0x16a at 300c798a               <<< holds the `cpu_hotplug_lock` for writing
>>  #7 [997dfc98] do_cpu_up+0xc6 at 300c7b66
>>  #8 [997dfcd8] cpu_subsys_online+0x58 at 305a0a00
>>  #9 [997dfd28] device_online+0x9e at 30598e7e
>> #10 [997dfd68] online_store+0x88 at 30598f28
>> #11 [997dfda8] kernfs_fop_write+0xdc at 3040276c
>> #12 [997dfdf8] vfs_write+0xa8 at 30354760
>> #13 [997dfe58] ksys_write+0x62 at 30354a32
>>
>> crash> cpuhp_cpu_state.state cpuhp_state:a | paste - -
>> [0]: 1aef424e0      state = CPUHP_ONLINE,               # (195)
>> [1]: 1aef654e0      state = CPUHP_AP_WORKQUEUE_ONLINE,  # (159)
>>
>> crash> cpuhp_cpu_state.bringup,thread,done_up.done cpuhp_state:1 -d | paste - - - -
>> [1]: 1aef654e0      bringup = true,      thread = 0x81134400,      done_up.done = 0,  <<<
>>
>> crash> b 17
>> PID: 17       TASK: 81134400          CPU: 1    COMMAND: "cpuhp/1"
>>  #0 [81143b68] __schedule+0x34c at 3089c424
>>  #1 [81143bd8] schedule+0x7e at 3089cafe
>>  #2 [81143c18] schedule_preempt_disabled+0x2a at 3089cfba
>>  #3 [81143c30] __mutex_lock+0x320 at 3089df60
>>
>>  #4 [81143cb0] workqueue_online_cpu+0x9e at 300e847e   <<< blocked on `wq_pool_attach_mutex`
>>  #5 [81143d20] cpuhp_invoke_callback+0xcc at 300c4f14
>>  #6 [81143db8] cpuhp_thread_fun+0x108 at 300c6848      <<< supposed to complete the bring-up of the CPU (`cpuhp_state:1.done_up`)
>>
>> crash> b 745
>> PID: 745      TASK: 82359100          CPU: 0    COMMAND: "xfs-conv/dm-0"
>>  #0 [8b4bfa20] __schedule+0x34c at 3089c424
>>  #1 [8b4bfa90] schedule+0x7e at 3089cafe
>>  #2 [8b4bfad0] schedule_timeout+0x26e at 308a1d8e
>>      [inlined] do_wait_for_common
>>      [inlined] __wait_for_common
>>  #3 [8b4bfb88] wait_for_common+0x14a at 3089d902
>>     [ret call] wait_for_completion+0x1a at 3089d96a
>>
>>  #4 [8b4bfc08] affine_move_task+0x10a at 300fb51a        <<< blocked on `task->migration_pending->done` completion
>>  #5 [8b4bfd08] __set_cpus_allowed_ptr+0x12e at 300fb926
>>     [ret call] set_cpus_allowed_ptr+0xa at 300fba32
>>  #6 [8b4bfd78] worker_attach_to_pool+0x66 at 300e1dae    <<< holds the `wq_pool_attach_mutex`
>>  #7 [8b4bfdc8] rescuer_thread+0x12c at 300e5bac
>>
>> crash> rx 8b4bfea0
>>         8b4bfea0:  [863373c0:kmalloc-192] 
>>
>> crash> worker.task,rescue_wq 863373c0
>>   task = 0x82359100,
>>   rescue_wq = 0x8aa44400,
>>
>> crash> list -s pool_workqueue.pool pool_workqueue.mayday_node -hO workqueue_struct.maydays 0x8aa44400 | paste - -
>> 1fffff7f751900    pool = 0x1aef56a00,
>>
>> crash> worker_pool.attrs 0x1aef56a00
>>   attrs = 0x80088180,
>>
>> crash> workqueue_attrs.cpumask[0].bits 0x80088180
>>   cpumask[0].bits = {0x1, 0x0, ...
>>
>> crash> cpumask.bits __cpu_active_mask
>>   bits = {0x1, 0x0, ...
>>
>> crash> cpumask.bits __cpu_online_mask
>>   bits = {0x3, 0x0, ...
>>
>> crash> task_struct.migration_pending,flags 0x82359100
>>     migration_pending = 0x8b4bfce8,
>>   flags = 0x4208060,
>>              ^ PF_KTHREAD
>>
>> crash> pd distribute_cpu_mask_prev:0
>> per_cpu(distribute_cpu_mask_prev, 0) = 0
>>
>> crash> set_affinity_pending.refs.refs.counter,arg,stop_pending,done.done 0x8b4bfce8 -d
>>   refs.refs.counter = 1
>>   arg = {
>>     task = 0x82359100,
>>     dest_cpu = 0,
>>     pending = 0x8b4bfce8
>>   }
>>   stop_pending = 1,
>>   done.done = 0,
>> ~~~
>>
>> In other words the `set_cpus_allowed_ptr()` is called from a worker thread which
>> tries to migrate. The worker pool is only allowed on CPU 0 and that was supposed
>> to be the destination as per the stack structure. In this case I thought it's OK
>> to leave the task on the old CPU
>
> AFAICT if a call to set_cpus_allowed_ptr() ends up in affine_move_task()
> and down to the stopper call, that means the task isn't allowed on its
> current CPU and needs to be moved.
>
>> and the Bill's testing scenario was successful
>> with the proposed patch. IIUC, it's exercising the hotplug due to load-balancing.
>>
>> This was on RHEL 8.8.z kernel. I see upstream changed a bit so I'm not sure it's
>> still reproducible. Also, I'm not sure why this only happens on s390 and not on
>> x86. I imagine the CPU hotplug slightly differs? Anyways this seems to be timing
>> sensitive and the timing will differ greatly for sure.
>>
>
> Thanks for the extra context!
>
> Double checking what I wrote before, I forgot RCU considers preempt-off
> sections as read-side critical sections. __set_cpus_allowed_ptr() already
> has preemption disabled all the way from reading the cpu_active_mask to the
> stop_one_cpu_nowait() call via task_rq_lock() + preempt_disable().
>
> IOW we have:
>
> __set_cpus_allowed_ptr()
>   task_rq_lock() <-- PREEMPT OFF
>   __set_cpus_allowed_ptr_locked()
>     cpu_valid_mask = cpu_active_mask;
>     dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>     affine_move_task()
>       preempt_disable();
>       task_rq_unlock();
>       stop_one_cpu_nowait(); <-- preemption still OFF
>
> And, considering:
>
> sched_cpu_deactivate()
>   set_cpu_active(cpu, false);
>   synchronize_rcu();
>
> Then, if __set_cpus_allowed_ptr() observes a CPU as being in the
> cpu_active_mask and uses that one as a destination CPU, said CPU cannot
> reach CPUHP_TEARDOWN_CPU:take_cpu_down() and park the stopper thread
> because its hotplug machinery will wait on the synchronize_rcu() in
> CPUHP_AP_ACTIVE:sched_cpu_deactivate().
>
> So "in theory", this shouldn't happen upstream.

Eh nevermind, in your stacktrace the relevant task is a rescuer thread
which is a kthread, so the cpu_valid_mask in use there is
cpu_online_mask, not cpu_valid_mask... Back to reading code :-)