linux-kernel - Re: Crashes in perf_event_ctx_lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171031134850.ynix2zqypmca2mtt@hirez.programming.kicks-ass.net>
Date:   Tue, 31 Oct 2017 14:48:50 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Guenter Roeck <linux@...ck-us.net>
Cc:     Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
        Don Zickus <dzickus@...hat.com>, Ingo Molnar <mingo@...nel.org>
Subject: Re: Crashes in perf_event_ctx_lock_nested

On Mon, Oct 30, 2017 at 03:45:12PM -0700, Guenter Roeck wrote:
> I added some logging and a long msleep() in hardlockup_detector_perf_cleanup().
> Here is the result:
> 
> [    0.274361] NMI watchdog: ############ hardlockup_detector_perf_init
> [    0.274915] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    0.277049] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [    0.277593] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [    0.278027] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    1.312044] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
> [    1.385122] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [    1.386028] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [    1.466102] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [    1.475536] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [    1.535099] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [    1.535101] NMI watchdog: ############ hardlockup_detector_event_create(3)

> [    7.222816] NMI watchdog: ############ hardlockup_detector_perf_disable(0)
> [    7.230567] NMI watchdog: ############ hardlockup_detector_perf_disable(1)
> [    7.243138] NMI watchdog: ############ hardlockup_detector_perf_disable(2)
> [    7.250966] NMI watchdog: ############ hardlockup_detector_perf_disable(3)
> [    7.258826] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [    7.258827] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [    7.258831] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [    7.258833] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [    7.258834] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [    7.258835] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [    7.260169] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [    7.260170] NMI watchdog: ############ hardlockup_detector_event_create(3)
> [    7.494251] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [    8.287135] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
> 
> Looks like there are a number of problems: hardlockup_detector_event_create()
> creates the event data structure even if it is already created, 

Right, that does look dodgy. And on its own should be fairly straight
forward to cure. But I'd like to understand the rest of it first.

> and hardlockup_detector_perf_cleanup() runs unprotected and in
> parallel to the enable/create functions.

Well, looking at the code, cpu_maps_update_begin() aka.
cpu_add_remove_lock is serializing cpu_up() and cpu_down() and _should_
thereby also serialize cleanup vs the smp_hotplug_thread operations.

Your trace does indeed indicate this is not the case, but I cannot, from
the code, see how this could happen.

Could you use trace_printk() instead and boot with
"trace_options=stacktrace" ?

> ALso, the following message is seen twice.
> 
> [    0.278758] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> [    7.258838] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> 
> I don't offer a proposed patch since I have no idea how to best solve the
> problem.
> 
> Also, is the repeated enable/disable/cleanup as part of the normal boot
> really necessary ?

That's weird, I don't see that on my machines. We very much only bring
up the CPUs _once_. Also note they're 7s apart. Did you do something
funny like resume-from-disk or so?