linux-kernel - Re: [BUG] Kernel splat when taking CPUs offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1625417.XZkzNdoaJA@vostro.rjw.lan>
Date:	Thu, 09 Jul 2015 02:13:45 +0200
From:	"Rafael J. Wysocki" <rjw@...ysocki.net>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Saravana Kannan <skannan@...eaurora.org>,
	Linux PM list <linux-pm@...r.kernel.org>,
	ACPI Devel Maling List <linux-acpi@...r.kernel.org>
Subject: Re: [BUG] Kernel splat when taking CPUs offline

On Wednesday, July 08, 2015 03:24:56 PM Steven Rostedt wrote:
> 
> My tests for ftrace includes testing the mmiotracer, which to run
> requires taking all CPUs offline but one of them. This test crashed
> every so often, and I was able to bisect down to this commit:
> 
> commit 87549141d516 ("cpufreq: Stop migrating sysfs files on hotplug")

Thanks for the report, adding linux-pm and linux-acpi to the CC.


> Just to make sure this wasn't just the mmiotracer causing the issue, I
> was able to trigger this same bug by simply doing the following:
> 
> 
> (on a 4 cpu machine)
> 
> 
>  # echo 0 > /sys/devices/system/cpu/cpu1/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu3/online 
>  # echo 1 > /sys/devices/system/cpu/cpu1/online 
>  # echo 1 > /sys/devices/system/cpu/cpu2/online 
>  # echo 1 > /sys/devices/system/cpu/cpu3/online 
>  # echo 0 > /sys/devices/system/cpu/cpu1/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu2/online 
>  # echo 0 > /sys/devices/system/cpu/cpu3/online 
>  # echo 1 > /sys/devices/system/cpu/cpu1/online 
>  # echo 1 > /sys/devices/system/cpu/cpu2/online 
>  # echo 1 > /sys/devices/system/cpu/cpu3/online 
> 
> It usually takes two or three tries (shutting down all but one CPU, and
> starting them again) before it triggers.
> 
> Here's the splat:
> 
> Initializing CPU#1
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1609 at /home/rostedt/work/git/linux-trace.git/drivers/cpufreq/cpufreq.c:2350 cpufreq_update_policy+0xc8/0x139()

So the cpufreq driver's ->get() callback returns 0 for the given CPU and
that's what triggers the WARN_ON().  And it most likely returns 0, because
its internal data structure for that CPU is not present.

I *guess* that before the above commit policy was NULL in cpufreq_update_policy()
and we didn't get to the point where ->get() was called.

There seems to be a couple of ways to address that, but I'd like Viresh to have
a look at this too.


> Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 ppdev parport_pc r8169 parport microcode
> CPU: 0 PID: 1609 Comm: bash Tainted: G        W       4.2.0-rc1-test #26
> Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
>  00000000 00000000 ee47db9c c0cd04e6 c10d4463 ee47dbcc c0440fbe c1010460
>  00000000 00000649 c10d4463 0000092e c0a6dd28 c0a6dd28 f13fd600 00000000
>  ee47dda8 ee47dbdc c0440ff7 00000009 00000000 ee47ddb8 c0a6dd28 efb01bc0
> Call Trace:
>  [<c0cd04e6>] dump_stack+0x41/0x52
>  [<c0440fbe>] warn_slowpath_common+0x9d/0xb4
>  [<c0a6dd28>] ? cpufreq_update_policy+0xc8/0x139
>  [<c0a6dd28>] ? cpufreq_update_policy+0xc8/0x139
>  [<c0440ff7>] warn_slowpath_null+0x22/0x24
>  [<c0a6dd28>] cpufreq_update_policy+0xc8/0x139
>  [<c0a6dd99>] ? cpufreq_update_policy+0x139/0x139
>  [<c0a6dc9b>] ? cpufreq_update_policy+0x3b/0x139
>  [<c0a6bef7>] ? cpufreq_freq_transition_begin+0x97/0xd9
>  [<c046ea90>] ? __wake_up+0x1a/0x47
>  [<c0772682>] acpi_processor_ppc_has_changed+0x54/0x5d
>  [<c076f6b9>] acpi_cpu_soft_notify+0xb0/0xf1
>  [<c06d2859>] ? compute_batch_value+0xd/0x22
>  [<c06d2a38>] ? percpu_counter_hotcpu_callback+0x11/0x80
>  [<c0458c35>] notifier_call_chain+0x68/0x91
>  [<c047007b>] ? sched_debug_header+0x15c/0x58e
>  [<c0458c7c>] __raw_notifier_call_chain+0x1e/0x23
>  [<c04410c2>] __cpu_notify+0x24/0x39
>  [<c04414d9>] _cpu_up+0xef/0x105
>  [<c044153d>] cpu_up+0x4e/0x5f
>  [<c0ccb642>] cpu_subsys_online+0x13/0x15
>  [<c09134b4>] device_online+0x45/0x6e
>  [<c091350f>] online_store+0x32/0x4f
>  [<c09134dd>] ? device_online+0x6e/0x6e
>  [<c0911570>] dev_attr_store+0x24/0x29
>  [<c0587f31>] sysfs_kf_write+0x3a/0x41
>  [<c0587ef7>] ? sysfs_file_ops+0x48/0x48
>  [<c0587244>] kernfs_fop_write+0xe2/0x11f
>  [<c0587162>] ? kernfs_vma_page_mkwrite+0x6c/0x6c
>  [<c0532e3a>] __vfs_write+0x24/0x9b
>  [<c0532d25>] ? file_start_write+0x27/0x29
>  [<c0533355>] ? rw_verify_area+0xce/0xef
>  [<c0533843>] vfs_write+0x7a/0xc4
>  [<c0533a09>] SyS_write+0x54/0x7f
>  [<c0cdae58>] sysenter_do_call+0x12/0x12
> ---[ end trace e2c32eead4f4e541 ]---
> 
> I'll dig more into it, but wanted to give people a heads up.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/