lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 07 Jul 2017 00:47:29 -0700
From:   Vikram Mulukutla <markivx@...eaurora.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Rusty Russell <rusty@...tcorp.com.au>, Tejun Heo <tj@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Sebastian Sewior <bigeasy@...utronix.de>,
        linux-kernel-owner@...r.kernel.org
Subject: Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the
 control CPU

On 2017-07-04 13:20, Thomas Gleixner wrote:
> Vikram reported the following backtrace:
> 
>    BUG: scheduling while atomic: swapper/7/0/0x00000002
>    CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
>    schedule
>    schedule_hrtimeout_range_clock
>    schedule_hrtimeout
>    wait_task_inactive
>    __kthread_bind_mask
>    __kthread_bind
>    __kthread_unpark
>    kthread_unpark
>    cpuhp_online_idle
>    cpu_startup_entry
>    secondary_start_kernel
> 
> He analyzed correctly that a parked cpu hotplug thread of an offlined 
> CPU
> was still on the runqueue when the CPU came back online and tried to 
> unpark
> it. This causes the thread which invoked kthread_unpark() to call
> wait_task_inactive() and subsequently schedule() with preemption 
> disabled.
> His proposed workaround was to "make sure" that a parked thread has
> scheduled out when the CPU goes offline, so the situation cannot 
> happen.
> 
> But that's still wrong because the root cause is not the fact that the
> percpu thread is still on the runqueue and neither that preemption is
> disabled, which could be simply solved by enabling preemption before
> calling kthread_unpark().
> 
> The real issue is that the calling thread is the idle task of the 
> upcoming
> CPU, which is not supposed to call anything which might sleep.  The 
> moron,
> who wrote that code, missed completely that kthread_unpark() might end 
> up
> in schedule().
> 

Agreed. Presumably we are still OK with the cpu hotplug thread being
migrated off to random CPUs and its unfinished kthread_parkme racing 
with
a subsequent unpark? The cpuhp/X thread ends up on a random rq if it 
can't
do schedule() in time because migration/X yanks it off of the dying CPU 
X.
Apart from slightly disconcerting traces showing cpuhp/X momentarily 
executing
on CPU Y, there's no problem that I can see of course.

Probably worth mentioning that the following patch from Junaid Shahid 
seems
to indicate that such a race doesn't always work out as desired:
https://lkml.org/lkml/2017/4/28/755

> The solution is simpler than expected. The thread which controls the
> hotplug operation is waiting for the CPU to call complete() on the 
> hotplug
> state completion. So the idle task of the upcoming CPU can set its 
> state to
> CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the 
> control
> task on a different CPU, which then can safely do the unpark and kick 
> the
> now unparked hotplug thread of the upcoming CPU to complete the bringup 
> to
> the final target state.
> 
> Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully 
> up")
> Reported-by: Vikram Mulukutla <markivx@...eaurora.org>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Sebastian Siewior <bigeasy@...utronix.de>
> Cc: rusty@...tcorp.com.au
> Cc: tj@...nel.org
> Cc: akpm@...ux-foundation.org
> Cc: stable@...r.kernel.org
> 
> ---
>  kernel/cpu.c |   30 ++++++++++++++++--------------
>  1 file changed, 16 insertions(+), 14 deletions(-)
> 
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -271,11 +271,25 @@ void cpu_hotplug_enable(void)
>  EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
>  #endif	/* CONFIG_HOTPLUG_CPU */
> 
> +static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st);
> +
>  static int bringup_wait_for_ap(unsigned int cpu)
>  {
>  	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> 
> +	/* Wait for the CPU to reach IDLE_ONLINE */
>  	wait_for_completion(&st->done);
> +	BUG_ON(!cpu_online(cpu));
> +
> +	/* Unpark the stopper thread and the hotplug thread of the target cpu 
> */
> +	stop_machine_unpark(cpu);
> +	kthread_unpark(st->thread);
> +
> +	/* Should we go further up ? */
> +	if (st->target > CPUHP_AP_ONLINE_IDLE) {
> +		__cpuhp_kick_ap_work(st);
> +		wait_for_completion(&st->done);
> +	}
>  	return st->result;
>  }
> 
> @@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu)
>  	irq_unlock_sparse();
>  	if (ret)
>  		return ret;
> -	ret = bringup_wait_for_ap(cpu);
> -	BUG_ON(!cpu_online(cpu));
> -	return ret;
> +	return bringup_wait_for_ap(cpu);
>  }
> 
>  /*
> @@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp
>  void cpuhp_online_idle(enum cpuhp_state state)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> -	unsigned int cpu = smp_processor_id();
> 
>  	/* Happens for the boot cpu */
>  	if (state != CPUHP_AP_ONLINE_IDLE)
>  		return;
> 
>  	st->state = CPUHP_AP_ONLINE_IDLE;
> -
> -	/* Unpark the stopper thread and the hotplug thread of this cpu */
> -	stop_machine_unpark(cpu);
> -	kthread_unpark(st->thread);
> -
> -	/* Should we go further up ? */
> -	if (st->target > CPUHP_AP_ONLINE_IDLE)
> -		__cpuhp_kick_ap_work(st);
> -	else
> -		complete(&st->done);
> +	complete(&st->done);
>  }
> 
>  /* Requires cpu_add_remove_lock to be held */

Nice and clean otherwise. Channagoud was instrumental in collecting
data, theorizing with me and testing your fix, so if the concern I've
raised above doesn't matter, please add:

Tested-by: Channagoud Kadabi <ckadabi@...eaurora.org>

Thanks,
Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ