linux-kernel - Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only single active event"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YD/cnnuh/AHOL8hV@hirez.programming.kicks-ass.net>
Date:   Wed, 3 Mar 2021 19:59:42 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     kan.liang@...ux.intel.com
Cc:     mingo@...hat.com, linux-kernel@...r.kernel.org, acme@...nel.org,
        mark.rutland@....com, alexander.shishkin@...ux.intel.com,
        jolsa@...hat.com, namhyung@...nel.org, eranian@...gle.com,
        ak@...ux.intel.com, stable@...r.kernel.org
Subject: Re: [PATCH] Revert "perf/x86: Allow zero PEBS status with only
 single active event"

On Wed, Mar 03, 2021 at 05:42:18AM -0800, kan.liang@...ux.intel.com wrote:

> For some old CPUs (HSW and earlier), the PEBS status in a PEBS record
> may be mistakenly set to 0. To minimize the impact of the defect, the
> commit was introduced to try to avoid dropping the PEBS record for some
> cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates
> the local pebs_status accordingly. However, it doesn't correct the PEBS
> status in the PEBS record, which may trigger the crash, especially for
> the large PEBS.
> 
> It's possible that all the PEBS records in a large PEBS have the PEBS
> status 0. If so, the first get_next_pebs_record_by_bit() in the
> __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large
> PEBS, the 'count' parameter must > 1. The second
> get_next_pebs_record_by_bit() will crash.
> 
> Two solutions were considered to fix the crash.
> - Keep the SW workaround and add extra checks in the
>   get_next_pebs_record_by_bit() to workaround the issue. The
>   get_next_pebs_record_by_bit() is a critical path. The extra checks
>   will bring extra overhead for the latest CPUs which don't have the
>   defect. Also, the defect can only be observed on some old CPUs
>   (For example, the issue can be reproduced on an HSW client, but I
>   didn't observe the issue on my Haswell server machine.). The impact
>   of the defect should be limit.
>   This solution is dropped.
> - Drop the SW workaround and revert the commit.
>   It seems that the commit never works, because the PEBS status in the
>   PEBS record never be changed. The get_next_pebs_record_by_bit() only
>   checks the PEBS status in the PEBS record. The record is dropped
>   eventually. Reverting the commit should not change the current
>   behavior.

> +++ b/arch/x86/events/intel/ds.c
> @@ -2000,18 +2000,6 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
>  			continue;
>  		}
>  
> -		/*
> -		 * On some CPUs the PEBS status can be zero when PEBS is
> -		 * racing with clearing of GLOBAL_STATUS.
> -		 *
> -		 * Normally we would drop that record, but in the
> -		 * case when there is only a single active PEBS event
> -		 * we can assume it's for that event.
> -		 */
> -		if (!pebs_status && cpuc->pebs_enabled &&
> -			!(cpuc->pebs_enabled & (cpuc->pebs_enabled-1)))
> -			pebs_status = cpuc->pebs_enabled;

Wouldn't something like:

			pebs_status = p->status = cpus->pebs_enabled;

actually fix things without adding overhead?