linux-kernel - Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when multiple PMCs are active

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190315120311.GX5996@hirez.programming.kicks-ass.net>
Date:   Fri, 15 Mar 2019 13:03:11 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Lendacky, Thomas" <Thomas.Lendacky@....com>
Cc:     "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Namhyung Kim <namhyung@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Jiri Olsa <jolsa@...hat.com>
Subject: Re: [RFC PATCH 2/2] x86/perf/amd: Resolve NMI latency issues when
 multiple PMCs are active

On Mon, Mar 11, 2019 at 04:48:51PM +0000, Lendacky, Thomas wrote:
> @@ -467,6 +470,45 @@ static void amd_pmu_wait_on_overflow(int idx, u64 config)
>  	}
>  }
>  
> +/*
> + * Because of NMI latency, if multiple PMC counters are active we need to take
> + * into account that multiple PMC overflows can generate multiple NMIs but be
> + * handled by a single invocation of the NMI handler (think PMC overflow while
> + * in the NMI handler). This could result in subsequent unknown NMI messages
> + * being issued.
> + *
> + * Attempt to mitigate this by using the number of active PMCs to determine
> + * whether to return NMI_HANDLED if the perf NMI handler did not handle/reset
> + * any PMCs. The per-CPU perf_nmi_counter variable is set to a minimum of one
> + * less than the number of active PMCs or 2. The value of 2 is used in case the
> + * NMI does not arrive at the APIC in time to be collapsed into an already
> + * pending NMI.

LAPIC I really do hope?!

> + */
> +static int amd_pmu_mitigate_nmi_latency(unsigned int active, int handled)
> +{
> +	/* If multiple counters are not active return original handled count */
> +	if (active <= 1)
> +		return handled;

Should we not reset perf_nmi_counter in this case?

> +
> +	/*
> +	 * If a counter was handled, record the number of possible remaining
> +	 * NMIs that can occur.
> +	 */
> +	if (handled) {
> +		this_cpu_write(perf_nmi_counter,
> +			       min_t(unsigned int, 2, active - 1));
> +
> +		return handled;
> +	}
> +
> +	if (!this_cpu_read(perf_nmi_counter))
> +		return NMI_DONE;
> +
> +	this_cpu_dec(perf_nmi_counter);
> +
> +	return NMI_HANDLED;
> +}
> +
>  static struct event_constraint *
>  amd_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>  			  struct perf_event *event)
> @@ -689,6 +731,7 @@ static __initconst const struct x86_pmu amd_pmu = {
>  
>  	.amd_nb_constraints	= 1,
>  	.wait_on_overflow	= amd_pmu_wait_on_overflow,
> +	.mitigate_nmi_latency	= amd_pmu_mitigate_nmi_latency,
>  };

Again, you could just do amd_pmu_handle_irq() and avoid an extra
callback.

Anyway, we already had code to deal with spurious NMIs from AMD; see
commit:

  63e6be6d98e1 ("perf, x86: Catch spurious interrupts after disabling counters")

And that looks to be doing something very much the same. Why then do you
still need this on top?