lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 1 May 2015 14:03:48 +0800
From:	Ethan Zhao <ethan.kernel@...il.com>
To:	Borislav Petkov <bp@...en8.de>
Cc:	Ethan Zhao <ethan.zhao@...cle.com>, tony.luck@...el.com,
	Thomas Gleixner <tglx@...utronix.de>, mingo@...hat.com,
	hpa@...or.com, x86@...nel.org, linux-edac@...r.kernel.org,
	LKML <linux-kernel@...r.kernel.org>, tim.uglow@...cle.com
Subject: Re: [PATCH] x86/mce: fix mce_restart() race with CPU hotplug operation

On Fri, May 1, 2015 at 12:29 AM, Borislav Petkov <bp@...en8.de> wrote:
> On Thu, Apr 30, 2015 at 12:04:53AM +0900, Ethan Zhao wrote:
>> while testing CPU hotplug and MCE with following two scripts,
>>
>> script 1:
>>
>>  for i in {1..30}; do while :; do  ((a=$RANDOM%160)); echo 0  >>
>>  /sys/devices/system/cpu/cpu${i}/online; echo 1 >>
>>  /sys/devices/system/cpu/cpu${i}/online; done & done
>>
>> script 2:
>>
>>  while :; do for i in $(ls
>>  /sys/devices/system/machinecheck/machinecheck*/check_interval); do echo 1  >>
>>  $i; done; done
>
> For the record, it is a public secret that CPU hotplug is broken. IOW,
> you're wasting your time with those senseless pounder tests but ok.

  :<,  Someone else is stressing the CPU hotplug, seems it is fragile.
My job is holding the system, not panic to the ground.

 Thanks,
 Ethan

>
> ...
>
>> ---
>>  arch/x86/kernel/cpu/mcheck/mce.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
>> index 3c036cb..fcc2794 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
>> @@ -1338,8 +1338,10 @@ static void mce_timer_delete_all(void)
>>  {
>>       int cpu;
>>
>> +     get_online_cpus();
>>       for_each_online_cpu(cpu)
>>               del_timer_sync(&per_cpu(mce_timer, cpu));
>> +     put_online_cpus();
>>  }
>>
>>  static void mce_do_trigger(struct work_struct *work)
>> @@ -2085,7 +2087,9 @@ static void mce_cpu_restart(void *data)
>>  static void mce_restart(void)
>>  {
>>       mce_timer_delete_all();
>> +     get_online_cpus();
>>       on_each_cpu(mce_cpu_restart, NULL, 1);
>> +     put_online_cpus();
>
> With your patch applied I get on 4.1-rc1+:
>
> ---
> [   41.364909] kvm: disabling virtualization on CPU1
> [   41.371083] smpboot: CPU 1 is now offline
> [   41.381190] x86: Booting SMP configuration:
> [   41.385405] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   41.402901] kvm: enabling virtualization on CPU1
> [   41.440944] kvm: disabling virtualization on CPU1
> [   41.447010] smpboot: CPU 1 is now offline
> [   41.486082] kvm: disabling virtualization on CPU6
> [   41.491827] smpboot: CPU 6 is now offline
> [   41.497521] smpboot: Booting Node 0 Processor 6 APIC 0x5
> [   41.514983] kvm: enabling virtualization on CPU6
> [   41.561643] kvm: disabling virtualization on CPU6
> [   41.566848] smpboot: CPU 6 is now offline
> [   41.572606] smpboot: Booting Node 0 Processor 6 APIC 0x5
> [   41.590049] kvm: enabling virtualization on CPU6
> [   41.636817] kvm: disabling virtualization on CPU6
> [   41.642575] smpboot: CPU 6 is now offline
> [   41.676812] kvm: disabling virtualization on CPU7
> [   41.682429] smpboot: CPU 7 is now offline
> [   41.687974] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   41.705416] kvm: enabling virtualization on CPU7
> [   41.752739] kvm: disabling virtualization on CPU7
> [   41.758455] smpboot: CPU 7 is now offline
> [   41.764089] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   41.781561] kvm: enabling virtualization on CPU7
> [   41.831610] kvm: disabling virtualization on CPU7
> [   41.837280] smpboot: CPU 7 is now offline
>
> [   41.843341] ======================================================
> [   41.849561] [ INFO: possible circular locking dependency detected ]
> [   41.855883] 4.1.0-rc1+ #2 Not tainted
> [   41.859564] -------------------------------------------------------
> [   41.865871] script2.sh/2071 is trying to acquire lock:
> [   41.871044]  (cpu_hotplug.lock){++++++}, at: [<ffffffff81066c02>] get_online_cpus+0x32/0x80
> [   41.879521]
> but task is already holding lock:
> [   41.885392]  (s_active#121){++++.+}, at: [<ffffffff812944de>] kernfs_fop_write+0x6e/0x1a0
> [   41.893695]
> which lock already depends on the new lock.
>
> [   41.901925]
> the existing dependency chain (in reverse order) is:
> [   41.909465]
> -> #2 (s_active#121){++++.+}:
> [   41.913739]        [<ffffffff810c6e11>] lock_acquire+0xd1/0x2b0
> [   41.919718]        [<ffffffff81292088>] __kernfs_remove+0x228/0x300
> [   41.926046]        [<ffffffff81293269>] kernfs_remove_by_name_ns+0x49/0xb0
> [   41.932976]        [<ffffffff81294de5>] sysfs_remove_file_ns+0x15/0x20
> [   41.939552]        [<ffffffff81600939>] device_remove_file+0x19/0x20
> [   41.945968]        [<ffffffff810308b4>] mce_device_remove+0x54/0xd0
> [   41.952284]        [<ffffffff81030b89>] mce_cpu_callback+0x69/0x120
> [   41.958608]        [<ffffffff8108fb46>] notifier_call_chain+0x66/0x90
> [   41.965124]        [<ffffffff8108fb7e>] __raw_notifier_call_chain+0xe/0x10
> [   41.972053]        [<ffffffff81066d33>] cpu_notify+0x23/0x50
> [   41.977761]        [<ffffffff81066dfe>] cpu_notify_nofail+0xe/0x20
> [   41.983986]        [<ffffffff818961d6>] _cpu_down+0x1b6/0x2d0
> [   41.989787]        [<ffffffff81896326>] cpu_down+0x36/0x50
> [   41.995324]        [<ffffffff816087d4>] cpu_subsys_offline+0x14/0x20
> [   42.001734]        [<ffffffff816030c5>] device_offline+0x95/0xc0
> [   42.007797]        [<ffffffff816031cd>] online_store+0x3d/0x90
> [   42.013673]        [<ffffffff81600608>] dev_attr_store+0x18/0x30
> [   42.019735]        [<ffffffff81294ca9>] sysfs_kf_write+0x49/0x60
> [   42.025796]        [<ffffffff812945b0>] kernfs_fop_write+0x140/0x1a0
> [   42.032211]        [<ffffffff8120d878>] __vfs_write+0x28/0xf0
> [   42.038013]        [<ffffffff8120df89>] vfs_write+0xa9/0x1b0
> [   42.043715]        [<ffffffff8120ed79>] SyS_write+0x49/0xb0
> [   42.049334]        [<ffffffff818aacf2>] system_call_fastpath+0x16/0x7a
> [   42.055926]
> -> #1 (cpu_hotplug.lock#2){+.+.+.}:
> [   42.060705]        [<ffffffff810c6e11>] lock_acquire+0xd1/0x2b0
> [   42.066674]        [<ffffffff818a5771>] mutex_lock_nested+0x61/0x460
> [   42.073087]        [<ffffffff81066e8b>] cpu_hotplug_begin+0x7b/0xd0
> [   42.079400]        [<ffffffff81066f4b>] _cpu_up+0x2b/0x160
> [   42.084927]        [<ffffffff8106710a>] cpu_up+0x8a/0xb0
> [   42.090285]        [<ffffffff81fac845>] smp_init+0x86/0x88
> [   42.095818]        [<ffffffff81f8e113>] kernel_init_freeable+0x160/0x27d
> [   42.102577]        [<ffffffff8189540e>] kernel_init+0xe/0xf0
> [   42.108455]        [<ffffffff818ab162>] ret_from_fork+0x42/0x70
> [   42.114586]
> -> #0 (cpu_hotplug.lock){++++++}:
> [   42.119497]        [<ffffffff810c60c6>] __lock_acquire+0x1db6/0x2040
> [   42.126062]        [<ffffffff810c6e11>] lock_acquire+0xd1/0x2b0
> [   42.132182]        [<ffffffff81066c28>] get_online_cpus+0x58/0x80
> [   42.138476]        [<ffffffff8102fc23>] mce_timer_delete_all+0x23/0x70
> [   42.145215]        [<ffffffff8102fd0e>] mce_restart+0xe/0x30
> [   42.151071]        [<ffffffff8102fddb>] store_int_with_restart+0x1b/0x30
> [   42.157975]        [<ffffffff81600608>] dev_attr_store+0x18/0x30
> [   42.164179]        [<ffffffff81294ca9>] sysfs_kf_write+0x49/0x60
> [   42.170383]        [<ffffffff812945b0>] kernfs_fop_write+0x140/0x1a0
> [   42.176938]        [<ffffffff8120d878>] __vfs_write+0x28/0xf0
> [   42.182879]        [<ffffffff8120df89>] vfs_write+0xa9/0x1b0
> [   42.188733]        [<ffffffff8120ed79>] SyS_write+0x49/0xb0
> [   42.194498]        [<ffffffff818aacf2>] system_call_fastpath+0x16/0x7a
> [   42.201230]
> other info that might help us debug this:
>
> [   42.210268] Chain exists of:
>   cpu_hotplug.lock --> cpu_hotplug.lock#2 --> s_active#121
>
> [   42.219214]  Possible unsafe locking scenario:
>
> [   42.225875]        CPU0                    CPU1
> [   42.230578]        ----                    ----
> [   42.235283]   lock(s_active#121);
> [   42.238791]                                lock(cpu_hotplug.lock#2);
> [   42.245362]                                lock(s_active#121);
> [   42.251401]   lock(cpu_hotplug.lock);
> [   42.255250]
>  *** DEADLOCK ***
>
> [   42.262008] 3 locks held by script2.sh/2071:
> [   42.266446]  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffff8120e043>] vfs_write+0x163/0x1b0
> [   42.274792]  #1:  (&of->mutex){+.+.+.}, at: [<ffffffff812944d6>] kernfs_fop_write+0x66/0x1a0
> [   42.283485]  #2:  (s_active#121){++++.+}, at: [<ffffffff812944de>] kernfs_fop_write+0x6e/0x1a0
> [   42.292353]
> stack backtrace:
> [   42.297084] CPU: 3 PID: 2071 Comm: script2.sh Not tainted 4.1.0-rc1+ #2
> [   42.303880] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
> [   42.311474]  ffffffff82847b60 ffff88043934fb68 ffffffff818a08e0 0000000000000011
> [   42.319092]  ffffffff828479b0 ffff88043934fbb8 ffffffff810c2a9d 000000000000aae0
> [   42.326712]  ffff88043934fc28 0000000000000002 ffff8800a82cc0b8 0000000000000002
> [   42.334332] Call Trace:
> [   42.336939]  [<ffffffff818a08e0>] dump_stack+0x4f/0x7b
> [   42.342251]  [<ffffffff810c2a9d>] print_circular_bug+0x1cd/0x230
> [   42.348436]  [<ffffffff810c60c6>] __lock_acquire+0x1db6/0x2040
> [   42.354455]  [<ffffffff810c34d9>] ? check_irq_usage+0x89/0xc0
> [   42.360385]  [<ffffffff810c6e11>] lock_acquire+0xd1/0x2b0
> [   42.365958]  [<ffffffff81066c02>] ? get_online_cpus+0x32/0x80
> [   42.371880]  [<ffffffff81066c28>] get_online_cpus+0x58/0x80
> [   42.377626]  [<ffffffff81066c02>] ? get_online_cpus+0x32/0x80
> [   42.383556]  [<ffffffff813f94ee>] ? simple_strtol+0xe/0x20
> [   42.389215]  [<ffffffff8102fc23>] mce_timer_delete_all+0x23/0x70
> [   42.395401]  [<ffffffff8102fd0e>] mce_restart+0xe/0x30
> [   42.400721]  [<ffffffff8102fddb>] store_int_with_restart+0x1b/0x30
> [   42.407080]  [<ffffffff81600608>] dev_attr_store+0x18/0x30
> [   42.412739]  [<ffffffff81294ca9>] sysfs_kf_write+0x49/0x60
> [   42.418396]  [<ffffffff812945b0>] kernfs_fop_write+0x140/0x1a0
> [   42.424405]  [<ffffffff8120d878>] __vfs_write+0x28/0xf0
> [   42.429803]  [<ffffffff8120e043>] ? vfs_write+0x163/0x1b0
> [   42.435385]  [<ffffffff81188b94>] ? context_tracking_exit+0x64/0x1c0
> [   42.441922]  [<ffffffff8120df89>] vfs_write+0xa9/0x1b0
> [   42.447234]  [<ffffffff8120ed79>] SyS_write+0x49/0xb0
> [   42.452459]  [<ffffffff818aacf2>] system_call_fastpath+0x16/0x7a
> [   42.459043] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   42.476980] kvm: enabling virtualization on CPU7
> [   42.525076] kvm: disabling virtualization on CPU7
> [   42.530297] smpboot: CPU 7 is now offline
> [   42.535071] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   42.552746] kvm: enabling virtualization on CPU7
> [   42.600174] kvm: disabling virtualization on CPU7
> [   42.605374] smpboot: CPU 7 is now offline
> [   42.610194] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   42.627827] kvm: enabling virtualization on CPU7
> [   42.681186] kvm: disabling virtualization on CPU7
> [   42.687093] smpboot: CPU 7 is now offline
> [   42.692417] smpboot: Booting Node 0 Processor 7 APIC 0x7
> [   42.710046] kvm: enabling virtualization on CPU7
> [   42.751031] kvm: disabling virtualization on CPU7
> [   42.756130] smpboot: CPU 7 is now offline
>
> --
> Regards/Gruss,
>     Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ