lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 07 Jul 2017 00:53:16 -0700
From:   Vikram Mulukutla <markivx@...eaurora.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Rusty Russell <rusty@...tcorp.com.au>, Tejun Heo <tj@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Sebastian Sewior <bigeasy@...utronix.de>,
        linux-kernel-owner@...r.kernel.org
Subject: Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the
 control CPU

On 2017-07-07 00:47, Vikram Mulukutla wrote:
> On 2017-07-04 13:20, Thomas Gleixner wrote:
>> Vikram reported the following backtrace:
>> 
>>    BUG: scheduling while atomic: swapper/7/0/0x00000002
>>    CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
>>    schedule
>>    schedule_hrtimeout_range_clock
>>    schedule_hrtimeout
>>    wait_task_inactive
>>    __kthread_bind_mask
>>    __kthread_bind
>>    __kthread_unpark
>>    kthread_unpark
>>    cpuhp_online_idle
>>    cpu_startup_entry
>>    secondary_start_kernel
>> 
>> He analyzed correctly that a parked cpu hotplug thread of an offlined 
>> CPU
>> was still on the runqueue when the CPU came back online and tried to 
>> unpark
>> it. This causes the thread which invoked kthread_unpark() to call
>> wait_task_inactive() and subsequently schedule() with preemption 
>> disabled.
>> His proposed workaround was to "make sure" that a parked thread has
>> scheduled out when the CPU goes offline, so the situation cannot 
>> happen.
>> 
>> But that's still wrong because the root cause is not the fact that the
>> percpu thread is still on the runqueue and neither that preemption is
>> disabled, which could be simply solved by enabling preemption before
>> calling kthread_unpark().
>> 
>> The real issue is that the calling thread is the idle task of the 
>> upcoming
>> CPU, which is not supposed to call anything which might sleep.  The 
>> moron,
>> who wrote that code, missed completely that kthread_unpark() might end 
>> up
>> in schedule().
>> 
> 
> Agreed. Presumably we are still OK with the cpu hotplug thread being
> migrated off to random CPUs and its unfinished kthread_parkme racing 
> with
> a subsequent unpark? The cpuhp/X thread ends up on a random rq if it 
> can't
> do schedule() in time because migration/X yanks it off of the dying CPU 
> X.
> Apart from slightly disconcerting traces showing cpuhp/X momentarily 
> executing
> on CPU Y, there's no problem that I can see of course.
> 
> Probably worth mentioning that the following patch from Junaid Shahid 
> seems
> to indicate that such a race doesn't always work out as desired:
> https://lkml.org/lkml/2017/4/28/755

Ah, Junaid's problem/patch wouldn't be relevant in the hotplug case 
because of the
completion I think.

> 
>> The solution is simpler than expected. The thread which controls the
>> hotplug operation is waiting for the CPU to call complete() on the 
>> hotplug
>> state completion. So the idle task of the upcoming CPU can set its 
>> state to
>> CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the 
>> control
>> task on a different CPU, which then can safely do the unpark and kick 
>> the
>> now unparked hotplug thread of the upcoming CPU to complete the 
>> bringup to
>> the final target state.
>> 
>> Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully 
>> up")
>> Reported-by: Vikram Mulukutla <markivx@...eaurora.org>
>> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Sebastian Siewior <bigeasy@...utronix.de>
>> Cc: rusty@...tcorp.com.au
>> Cc: tj@...nel.org
>> Cc: akpm@...ux-foundation.org
>> Cc: stable@...r.kernel.org
>> 
>> ---
>>  kernel/cpu.c |   30 ++++++++++++++++--------------
>>  1 file changed, 16 insertions(+), 14 deletions(-)
>> 
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -271,11 +271,25 @@ void cpu_hotplug_enable(void)
>>  EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
>>  #endif	/* CONFIG_HOTPLUG_CPU */
>> 
>> +static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st);
>> +
>>  static int bringup_wait_for_ap(unsigned int cpu)
>>  {
>>  	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>> 
>> +	/* Wait for the CPU to reach IDLE_ONLINE */
>>  	wait_for_completion(&st->done);
>> +	BUG_ON(!cpu_online(cpu));
>> +
>> +	/* Unpark the stopper thread and the hotplug thread of the target 
>> cpu */
>> +	stop_machine_unpark(cpu);
>> +	kthread_unpark(st->thread);
>> +
>> +	/* Should we go further up ? */
>> +	if (st->target > CPUHP_AP_ONLINE_IDLE) {
>> +		__cpuhp_kick_ap_work(st);
>> +		wait_for_completion(&st->done);
>> +	}
>>  	return st->result;
>>  }
>> 
>> @@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu)
>>  	irq_unlock_sparse();
>>  	if (ret)
>>  		return ret;
>> -	ret = bringup_wait_for_ap(cpu);
>> -	BUG_ON(!cpu_online(cpu));
>> -	return ret;
>> +	return bringup_wait_for_ap(cpu);
>>  }
>> 
>>  /*
>> @@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp
>>  void cpuhp_online_idle(enum cpuhp_state state)
>>  {
>>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>> -	unsigned int cpu = smp_processor_id();
>> 
>>  	/* Happens for the boot cpu */
>>  	if (state != CPUHP_AP_ONLINE_IDLE)
>>  		return;
>> 
>>  	st->state = CPUHP_AP_ONLINE_IDLE;
>> -
>> -	/* Unpark the stopper thread and the hotplug thread of this cpu */
>> -	stop_machine_unpark(cpu);
>> -	kthread_unpark(st->thread);
>> -
>> -	/* Should we go further up ? */
>> -	if (st->target > CPUHP_AP_ONLINE_IDLE)
>> -		__cpuhp_kick_ap_work(st);
>> -	else
>> -		complete(&st->done);
>> +	complete(&st->done);
>>  }
>> 
>>  /* Requires cpu_add_remove_lock to be held */
> 
> Nice and clean otherwise. Channagoud was instrumental in collecting
> data, theorizing with me and testing your fix, so if the concern I've
> raised above doesn't matter, please add:
> 
> Tested-by: Channagoud Kadabi <ckadabi@...eaurora.org>
> 
> Thanks,
> Vikram

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ