linux-kernel - Re: + generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AAA0001.2060703@cn.fujitsu.com>
Date:	Fri, 11 Sep 2009 15:45:05 +0800
From:	Xiao Guangrong <xiaoguangrong@...fujitsu.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	akpm@...ux-foundation.org, mm-commits@...r.kernel.org,
	jens.axboe@...cle.com, mingo@...e.hu, nickpiggin@...oo.com.au,
	rusty@...tcorp.com.au, suresh.b.siddha@...el.com,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: + generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd.patch
 added to -mm tree



Peter Zijlstra wrote:
> On Thu, 2009-07-30 at 17:30 -0700, akpm@...ux-foundation.org wrote:
> 
>> ------------------------------------------------------
>> Subject: generic-ipi: fix the race between generic_smp_call_function_*() and hotplug_cfd()
>> From: Xiao Guangrong <xiaoguangrong@...fujitsu.com>
>>
>> There is a race between generic_smp_call_function_*() and hotplug_cfd() in
>> many cases, see below examples:
>>
>> 1: hotplug_cfd() can free cfd->cpumask, the system will crash if the
>>    cpu's cfd still in the call_function list:
>>
>>
>>       CPU A:                         CPU B
>>
>>  smp_call_function_many()	    ......
>>    cpu_down()                      ......
>>   hotplug_cfd() ->                 ......
>>  free_cpumask_var(cfd->cpumask)  (receive function IPI interrupte)
>>                                 /* read cfd->cpumask */
>>                           generic_smp_call_function_interrupt() ->
>>                          cpumask_test_and_clear_cpu(cpu, data->cpumask)
>>
>>                          	CRASH!!!
>>
>> 2: It's not handle call_function list when cpu down, It's will lead to
>>    dead-wait if other path is waiting this cpu to execute function
>>
>>     CPU A:                           CPU B
>>
>>  smp_call_function_many(wait=0)
>>         ......			    CPU B down
>>    smp_call_function_many() -->  (cpu down before recevie function
>>     csd_lock(&data->csd);         IPI interrupte)
>>
>>     DEAD-WAIT!!!!
>>
>>   So, CPU A will dead-wait in csd_lock(), the same as
>>   smp_call_function_single()
> 
> On re-reading this, I'm wondering if 2 is a real case.
> 
> I'm thinking it should never happen since you're supposed to do things
> like get_online_cpus() around stuff like this, but then again, I doubt
> we actually do.
> 

I think that get_online_cpus() and stop_machine() can't avoid this
race,

1: For the first example in my patch's changlog:

      CPU A:                         CPU B

 smp_call_function_many(wait=0)	    ......
   cpu_down()                      ......
  hotplug_cfd() ->                 ......
 free_cpumask_var(cfd->cpumask)  (receive function IPI interrupte)
                                /* read cfd->cpumask */
                          generic_smp_call_function_interrupt() ->
                         cpumask_test_and_clear_cpu(cpu, data->cpumask)

                         	CRASH!!!

CPU A call smp_call_function_many(wait=0) that want CPU B to call
a specific function, after smp_call_function_many() return, we let
CPU A offline immediately. Unfortunately, if CPU B receives this
IPI interrupt after CPU A down, it will crash like above description.

2: For the second example in my patch's changlog:

If CPU B is dying, like below:

_cpu_down()
{
	......
	
	/* We suppose that have below sequences:
	 * before call __stop_machine(), CPU B is online (in cpu_online_mask),
	 * in this time, CPU A call smp_call_function_many(wait=0) and want
	 * CPU B to call a specific function, after CPU A finish it, CPU B
	 * go to __stop_machine() and disable it's interrupt
	 * (suppose CPU B not receive IPI interrupt in this time now)
	 */
	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
	......
}

Now, CPU B is down, but it's not handle CPU A's request, it cause that
can't clean the CSD_FLAG_LOCK flag of CPU A's cfd_data, if CPU A
call smp_call_function_many() next time. it will block in 
csd_lock() -> csd_lock_wait(data) forever.

>> Signed-off-by: Xiao Guangrong <xiaoguangrong@...fujitsu.com>
>> Cc: Ingo Molnar <mingo@...e.hu>
>> Cc: Jens Axboe <jens.axboe@...cle.com>
>> Cc: Nick Piggin <nickpiggin@...oo.com.au>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Rusty Russell <rusty@...tcorp.com.au>
>> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
>> ---
>>
>>  kernel/smp.c |   38 ++++++++++++++++++++++++++++----------
>>  1 file changed, 28 insertions(+), 10 deletions(-)
>>
>> diff -puN kernel/smp.c~generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd kernel/smp.c
>> --- a/kernel/smp.c~generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd
>> +++ a/kernel/smp.c
>> @@ -113,14 +113,10 @@ void generic_exec_single(int cpu, struct
>>  		csd_lock_wait(data);
>>  }
>>  
>> -/*
>> - * Invoked by arch to handle an IPI for call function. Must be called with
>> - * interrupts disabled.
>> - */
>> -void generic_smp_call_function_interrupt(void)
>> +static void
>> +__generic_smp_call_function_interrupt(int cpu, int run_callbacks)
>>  {
>>  	struct call_function_data *data;
>> -	int cpu = smp_processor_id();
>>  
>>  	/*
>>  	 * Ensure entry is visible on call_function_queue after we have
> 
> Also, if this is the last version, we're still not using run_callbacks
> for anything..
> 

It's not the last version and fixed in another patch, see below URL please:
http://marc.info/?l=linux-mm-commits&m=124900028228350&w=2

Thanks,
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/