linux-kernel - Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <32888c33-c286-c600-66cb-8b1b03beeb8b@linux.intel.com>
Date:   Mon, 1 Mar 2021 08:20:48 -0500
From:   "Liang, Kan" <kan.liang@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Vince Weaver <vincent.weaver@...ne.edu>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        Stephane Eranian <eranian@...gle.com>
Subject: Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()

On 2/11/2021 9:53 AM, Peter Zijlstra wrote:
> 
> Kan, do you have time to look at this?
> 
> On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote:
>> On Thu, 28 Jan 2021, Vince Weaver wrote:
>>
>>> the perf_fuzzer has turned up a repeatable crash on my haswell system.
>>>
>>> addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST.
>>> I'll investigate more when I have the chance.
>>
>> so I poked around some more.
>>
>> This seems to be caused in
>>
>>     __intel_pmu_pebs_event()
>> 	get_next_pebs_record_by_bit()		ds.c line 1639
>> 		get_pebs_status(at)		ds.c line 1317
>> 			return ((struct pebs_record_nhm *)n)->status;
>>
>> where "n" has the value of 0xc0 rather than a proper pointer.
>>

I think I find the suspicious patch.
The commt id 01330d7288e00 ("perf/x86: Allow zero PEBS status with only 
single active event")
https://lore.kernel.org/lkml/tip-01330d7288e0050c5aaabc558059ff91589e67cd@git.kernel.org/
The patch is an SW workaround for some old CPUs (HSW and earlier), which 
may set 0 to the PEBS status. It adds a check in the 
intel_pmu_drain_pebs_nhm(). It tries to minimize the impact of the 
defect by avoiding dropping the PEBS records which have PEBS status 0.
But, it doesn't correct the PEBS status, which may bring problems,
especially for the large PEBS.
It's possible that all the PEBS records in a large PEBS have the PEBS 
status 0. If so, the first get_next_pebs_record_by_bit() in the 
__intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large 
PEBS, the 'count' parameter must > 1. The second 
get_next_pebs_record_by_bit() will crash.

Could you please revert the patch and check whether it fixes your issue?

Thanks,
Kan