linux-kernel - Re: [RESEND PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cba05d13-10dc-19a8-733d-a2b520dcfa5d@linux.intel.com>
Date:   Wed, 24 Jan 2018 10:45:03 -0500
From:   "Liang, Kan" <kan.liang@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>, kan.liang@...el.com
Cc:     mingo@...hat.com, acme@...nel.org, linux-kernel@...r.kernel.org,
        tglx@...utronix.de, jolsa@...hat.com, eranian@...gle.com,
        ak@...ux.intel.com
Subject: Re: [RESEND PATCH V2 1/4] perf/x86/intel: fix event update for
 auto-reload



On 1/24/2018 7:26 AM, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 07:15:13AM -0800, kan.liang@...el.com wrote:
> 
>> The formula to calculate the event->count is as below:
> 
>>    event->count = period left from last time +
>>                   (reload_times - 1) * reload_val +
>>                   latency of PMI handler
>>
>> prev_count is the last observed hardware counter value. Just the same as
>> non-auto-reload, its absolute value is the period of the first record.
>> It should not update with each reload. Because it doesn't 'observe' the
>> hardware counter for each auto-reload.
>>
>> For the second and later records, the period is exactly the reload
>> value. Just need to simply add (reload_times - 1) * reload_val to
>> event->count.
>>
>> The calculation of the latency of PMI handler is a little bit different
>> as non-auto-reload. Because the start point is -reload_value. It needs
>> to be adjusted by adding reload_value.
>> The period_left needs to do the same adjustment.
> 
> What's this about the PMI latency, we don't care about that in any other
> situation, right? Sure the PMI takes a bit of time, but we're not
> correcting for that anywhere, so why start now?

The latency is the gap between PEBS hardware is armed and the NMI is
handled.

Sorry for the misleading description.
I will rewrite the description in V3.


> 
>> There is nothing need to do in x86_perf_event_set_period(). Because it
>> is fixed period. The period_left is already adjusted.
> 
> Fixes tag is missing.

Will add it in V3.

> 
>> Signed-off-by: Kan Liang <kan.liang@...ux.intel.com>
>> ---
>>   arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 3674a4b..cc1f373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
>>   	return NULL;
>>   }
>>   
>> +/*
>> + * Specific intel_pmu_save_and_restart() for auto-reload.
>> + */
>> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
>> +					     u64 reload_val,
>> +					     int reload_times)
>> +{
>> +	struct hw_perf_event *hwc = &event->hw;
>> +	int shift = 64 - x86_pmu.cntval_bits;
>> +	u64 prev_raw_count, new_raw_count;
>> +	u64 delta;
>> +
>> +	if ((reload_times == 0) || (reload_val == 0))
>> +		return intel_pmu_save_and_restart(event);
> 
> Like Jiri, I find this confusing at best. If we need to call that one,
> you shouldn't have called this function to begin with.
> 
> At best, have a WARN here or something.
> 

Will add a WARN in V3.

>> +
>> +	/*
>> +	 * Careful: an NMI might modify the previous event value.
>> +	 *
>> +	 * Our tactic to handle this is to first atomically read and
>> +	 * exchange a new raw count - then add that new-prev delta
>> +	 * count to the generic event atomically:
>> +	 */
> 
> For now this seems to only get called from *drain_pebs* which afaict
> only happens when we've disabled the PMU (either from sched_task or
> PMI).
> 
> Now, you want to put this in the pmu::read() path, and that does not
> disable the PMU, but I don't think we can drain the PEBS buffer while
> its active, that's too full of races, so even there you'll have to
> disable stuff.
> 
> So I don't think this is correct/desired for this case.
>

Could we do something as below in the pmu::read() path? (not test yet)

if (pebs_needs_sched_cb(cpuc)) {
	perf_pmu_disable();
	intel_pmu_drain_pebs_buffer();
	x86_perf_event_update();
	perf_pmu_enable();
	return;
}

x86_perf_event_update() is to handle the case !reload_times which you 
mentioned as below. Because the PMU is disabled, nothing will be changed 
for other cases.


>> +again:
>> +	prev_raw_count = local64_read(&hwc->prev_count);
>> +	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>> +
>> +	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
>> +					new_raw_count) != prev_raw_count)
>> +		goto again;
>> +
>> +	/*
>> +	 * Now we have the new raw value and have updated the prev
>> +	 * timestamp already. We can now calculate the elapsed delta
>> +	 * (event-)time and add that to the generic event.
>> +	 *
>> +	 * Careful, not all hw sign-extends above the physical width
>> +	 * of the count.
>> +	 *
>> +	 *   event->count = period left from last time +
>> +	 *                  (reload_times - 1) * reload_val +
>> +	 *                  latency of PMI handler
>           *
>> +	 * The period left from last time can be got from -prev_count.
>> +	 * The start points of counting is always -reload_val.
>> +	 * So the real latency of PMI handler is reload_val + new_raw_count.
>> +	 */
> 
> That is very confused, the PMI latency is utterly unrelated to anything
> you do here.

Will fix the confusing comments in V3.

> 
>> +	delta = (reload_val << shift) + (new_raw_count << shift) -
>> +		(prev_raw_count << shift);
>> +	delta >>= shift;
>> +
>> +	local64_add(reload_val * (reload_times - 1), &event->count);
>> +	local64_add(delta, &event->count);
> 
> And this is still wrong I think. Consider the case where !reload_times.
> 
> We can easily call pmu::read() twice in one period. In that case we
> should increment count with (new - prev).
> 
> Only once we get a new sample and are known to have wrapped, do we need
> to consider that wrap.

The code as above should fix this issue.

> 
>> +	local64_sub(delta, &hwc->period_left);
>> +
>> +	return x86_perf_event_set_period(event);
>> +}
>> +
>>   static void __intel_pmu_pebs_event(struct perf_event *event,
>>   				   struct pt_regs *iregs,
>>   				   void *base, void *top,
>>   				   int bit, int count)
>>   {
>> +	struct hw_perf_event *hwc = &event->hw;
>>   	struct perf_sample_data data;
>>   	struct pt_regs regs;
>>   	void *at = get_next_pebs_record_by_bit(base, top, bit);
>>   
>> -	if (!intel_pmu_save_and_restart(event) &&
>> -	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
>> +	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
>> +		/*
>> +		 * Now, auto-reload is only enabled in fixed period mode.
>> +		 * The reload value is always hwc->sample_period.
>> +		 * May need to change it, if auto-reload is enabled in
>> +		 * freq mode later.
>> +		 */
>> +		intel_pmu_save_and_restart_reload(event, hwc->sample_period,
>> +						  count);
> 
> Since you pass in @event, hwc->sample_period is already available to it,
> no need to pass that in as well.

OK. I will change it.

Thanks,
Kan


> 
>> +	} else if (!intel_pmu_save_and_restart(event))
>>   		return;
>>   
>>   	while (count > 1) {