linux-kernel - Re: [PATCH] perf/core: Fix endless multiplex timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b71515e4-484e-d80a-37db-2e51abe69928@linux.intel.com>
Date:   Tue, 3 Mar 2020 20:40:10 -0500
From:   "Liang, Kan" <kan.liang@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, linux-kernel@...r.kernel.org, irogers@...gle.com,
        eranian@...gle.com, ak@...ux.intel.com
Subject: Re: [PATCH] perf/core: Fix endless multiplex timer



On 3/3/2020 4:08 PM, Peter Zijlstra wrote:
> On Tue, Mar 03, 2020 at 12:28:19PM -0800, kan.liang@...ux.intel.com wrote:
>> From: Kan Liang <kan.liang@...ux.intel.com>
>>
>> A lot of time are spent in writing uncore MSRs even though no perf is
>> running.
>>
>>    4.66%  swapper      [kernel.kallsyms]        [k] native_write_msr
>>              |
>>               --4.56%--native_write_msr
>>                         |
>>                         |--1.68%--snbep_uncore_msr_enable_box
>>                         |          perf_mux_hrtimer_handler
>>                         |          __hrtimer_run_queues
>>                         |          hrtimer_interrupt
>>                         |          smp_apic_timer_interrupt
>>                         |          apic_timer_interrupt
>>                         |          cpuidle_enter_state
>>                         |          cpuidle_enter
>>                         |          do_idle
>>                         |          cpu_startup_entry
>>                         |          start_kernel
>>                         |          secondary_startup_64
>>
>> The root cause is that multiplex timer was not stopped when perf stat
>> finished.
>> Current perf relies on rotate_necessary to determine whether the
>> multiplex timer should be stopped. The variable only be reset in
>> ctx_sched_out(), which is not enough for system-wide event.
>> Perf stat invokes PERF_EVENT_IOC_DISABLE to stop system-wide event
>> before closing it.
>>    perf_ioctl()
>>      perf_event_disable()
>>        event_sched_out()
>> The rotate_necessary will never be reset.
>>
>> The issue is a generic issue, not just impact the uncore.
>>
>> Check whether we had been multiplexing. If yes, reset rotate_necessary
>> for the last active event in __perf_event_disable().
>>
>> Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily")
>> Reported-by: Andi Kleen <ak@...ux.intel.com>
>> Reviewed-by: Andi Kleen <ak@...ux.intel.com>
>> Signed-off-by: Kan Liang <kan.liang@...ux.intel.com>
>> ---
>>   kernel/events/core.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 3f1f77de7247..50688de56181 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2242,6 +2242,16 @@ static void __perf_event_disable(struct perf_event *event,
>>   		update_cgrp_time_from_event(event);
>>   	}
>>   
>> +	/*
>> +	 * If we had been multiplexing,
>> +	 * stop the rotations for the last active event.
>> +	 * Only need to check system wide events.
>> +	 * For task events, it will be checked in ctx_sched_out().
>> +	 */
>> +	if ((cpuctx->ctx.nr_events != cpuctx->ctx.nr_active) &&
>> +	    (cpuctx->ctx.nr_active == 1))
>> +		cpuctx->ctx.rotate_necessary = 0;
>> +
>>   	if (event == event->group_leader)
>>   		group_sched_out(event, cpuctx, ctx);
>>   	else
> 
> 
> I'm thinking this is wrong.
> 
> That is, yes, this fixes the observed problem, but it also misses at
> least one other site. Which seems to suggest we ought to take a
> different approach.
> 
> But even with that; I wonder if the actual condition isn't wrong.
> Suppose the event was exclusive, and other events weren't scheduled
> because of that. Then you disable the one exclusive event _and_ kill
> rotation, so then nothing else will ever get on.
> 
> So what I think was supposed to happen is rotation killing itself;
> rotation will schedule out the context -- which will clear the flag, and
> then schedule the thing back in -- which will set the flag again when
> needed.
> 
> Now, that isn't happening, and I think I see why, because when we drop
> to !nr_active, we terminate ctx_sched_out() before we get to clearing
> the flag, oops!
> 
> So how about something like this?
> 
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e453589da97c..7947bd3271a9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2182,6 +2182,7 @@ __perf_remove_from_context(struct perf_event *event,
>   
>   	if (!ctx->nr_events && ctx->is_active) {
>   		ctx->is_active = 0;
> +		ctx->rotate_necessary = 0;
>   		if (ctx->task) {
>   			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>   			cpuctx->task_ctx = NULL;


The patch can fix the observed problem with uncore PMU.
But it cannot fix all the cases with core PMU, especially when NMI 
watchdog is enabled.
Because the ctx->nr_events never be 0 with NMI watchdog enabled.

For example,
$echo 1 > /proc/sys/kernel/nmi_watchdog
$sudo perf stat -e ref-cycles,ref-cycles -a sleep 1
  Performance counter stats for 'system wide':
     549,432,900      ref-cycles   (50.01%) 

     545,358,967      ref-cycles   (49.99%) 

  1.002367894 seconds time elapsed
$perf probe --add perf_rotate_context
$perf record -e probe:perf_rotate_context -aR -g sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.896 MB perf.data (1750 samples) ]
$perf report --stdio
    100.00%   100.00%  (ffffffffa5604e84)
             |
              --99.83%--0xffffffffa54000d4
                        start_secondary
                        cpu_startup_entry
                        do_idle
                        call_cpuidle
                        cpuidle_enter
                        cpuidle_enter_state
                        apic_timer_interrupt
                        smp_apic_timer_interrupt
                        hrtimer_interrupt
                        __hrtimer_run_queues
                        perf_mux_hrtimer_handler

Thanks,
Kan

> @@ -3074,15 +3075,15 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>   
>   	is_active ^= ctx->is_active; /* changed bits */
>   
> -	if (!ctx->nr_active || !(is_active & EVENT_ALL))
> -		return;
> -
>   	/*
>   	 * If we had been multiplexing, no rotations are necessary, now no events
>   	 * are active.
>   	 */
>   	ctx->rotate_necessary = 0;
>   
> +	if (!ctx->nr_active || !(is_active & EVENT_ALL))
> +		return;
> +
>   	perf_pmu_disable(ctx->pmu);
>   	if (is_active & EVENT_PINNED) {
>   		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
>