[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1deec4c5-f963-5772-2a0d-826016dc0170@arm.com>
Date: Thu, 22 Oct 2020 12:14:43 +0100
From: Suzuki Poulose <suzuki.poulose@....com>
To: Sai Prakash Ranjan <saiprakash.ranjan@...eaurora.org>
Cc: Mathieu Poirier <mathieu.poirier@...aro.org>,
mike.leach@...aro.org, coresight@...ts.linaro.org,
swboyd@...omium.org, linux-arm-msm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
denik@...gle.com, leo.yan@...aro.org, peterz@...radead.org
Subject: Re: [PATCH 1/2] coresight: tmc-etf: Fix NULL ptr dereference in
tmc_enable_etf_sink_perf()
On 10/22/20 12:07 PM, Sai Prakash Ranjan wrote:
> On 2020-10-22 14:57, Suzuki Poulose wrote:
>> On 10/22/20 9:02 AM, Sai Prakash Ranjan wrote:
>>> On 2020-10-21 15:38, Suzuki Poulose wrote:
>>>> On 10/21/20 8:29 AM, Sai Prakash Ranjan wrote:
>>>>> On 2020-10-20 21:40, Sai Prakash Ranjan wrote:
>>>>>> On 2020-10-14 21:29, Sai Prakash Ranjan wrote:
>>>>>>> On 2020-10-14 18:46, Suzuki K Poulose wrote:
>>>>>>>> On 10/14/2020 10:36 AM, Sai Prakash Ranjan wrote:
>>>>>>>>> On 2020-10-13 22:05, Suzuki K Poulose wrote:
>>>>>>>>>> On 10/07/2020 02:00 PM, Sai Prakash Ranjan wrote:
>>>>>>>>>>> There was a report of NULL pointer dereference in ETF enable
>>>>>>>>>>> path for perf CS mode with PID monitoring. It is almost 100%
>>>>>>>>>>> reproducible when the process to monitor is something very
>>>>>>>>>>> active such as chrome and with ETF as the sink and not ETR.
>>>>>>>>>>> Currently in a bid to find the pid, the owner is dereferenced
>>>>>>>>>>> via task_pid_nr() call in tmc_enable_etf_sink_perf() and with
>>>>>>>>>>> owner being NULL, we get a NULL pointer dereference.
>>>>>>>>>>>
>>>>>>>>>>> Looking at the ETR and other places in the kernel, ETF and the
>>>>>>>>>>> ETB are the only places trying to dereference the task(owner)
>>>>>>>>>>> in tmc_enable_etf_sink_perf() which is also called from the
>>>>>>>>>>> sched_in path as in the call trace. Owner(task) is NULL even
>>>>>>>>>>> in the case of ETR in tmc_enable_etr_sink_perf(), but since we
>>>>>>>>>>> cache the PID in alloc_buffer() callback and it is done as part
>>>>>>>>>>> of etm_setup_aux() when allocating buffer for ETR sink, we never
>>>>>>>>>>> dereference this NULL pointer and we are safe. So lets do the
>>>>>>>>>>
>>>>>>>>>> The patch is necessary to fix some of the issues. But I feel
>>>>>>>>>> it is
>>>>>>>>>> not complete. Why is it safe earlier and not later ? I believe
>>>>>>>>>> we are
>>>>>>>>>> simply reducing the chances of hitting the issue, by doing
>>>>>>>>>> this earlier than
>>>>>>>>>> later. I would say we better fix all instances to make sure
>>>>>>>>>> that the
>>>>>>>>>> event->owner is valid. (e.g, I can see that the for kernel events
>>>>>>>>>> event->owner == -1 ?)
>>>>>>>>>>
>>>>>>>>>> struct task_struct *tsk = READ_ONCE(event->owner);
>>>>>>>>>>
>>>>>>>>>> if (!tsk || is_kernel_event(event))
>>>>>>>>>> /* skip ? */
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking at it some more, is_kernel_event() is not exposed
>>>>>>>>> outside events core and probably for good reason. Why do
>>>>>>>>> we need to check for this and not just tsk?
>>>>>>>>
>>>>>>>> Because the event->owner could be :
>>>>>>>>
>>>>>>>> = NULL
>>>>>>>> = -1UL // kernel event
>>>>>>>> = valid.
>>>>>>>>
>>>>>>>
>>>>>>> Yes I understood that part, but here we were trying to
>>>>>>> fix the NULL pointer dereference right and hence the
>>>>>>> question as to why we need to check for kernel events?
>>>>>>> I am no expert in perf but I don't see anywhere in the
>>>>>>> kernel checking for is_kernel_event(), so I am a bit
>>>>>>> skeptical if exporting that is actually right or not.
>>>>>>>
>>>>>>
>>>>>> I have stress tested with the original patch many times
>>>>>> now, i.e., without a check for event->owner and is_kernel_event()
>>>>>> and didn't observe any crash. Plus on ETR where this was already
>>>>>> done, no crashes were reported till date and with ETF, the issue
>>>>>> was quickly reproducible, so I am fairly confident that this
>>>>>> doesn't just delay the original issue but actually fixes
>>>>>> it. I will run an overnight test again to confirm this.
>>>>>>
>>>>>
>>>>> I ran the overnight test which collected aroung 4G data(see below),
>>>>> with the following small change to see if the two cases
>>>>> (event->owner=NULL and is_kernel_event()) are triggered
>>>>> with suggested changes and it didn't trigger at all.
>>>>> Do we still need those additional checks?
>>>>>
>>>>
>>>> Yes. Please see perf_event_create_kernel_event(), which is
>>>> an exported function allowing any kernel code (including modules)
>>>> to use the PMU (just like the userspace perf tool would do).
>>>> Just because your use case doesn't trigger this (because
>>>> you don't run something that can trigger this) doesn't mean
>>>> this can't be triggered.
>>>>
>>>
>>> Thanks for that pointer, I will add them in the next version.
>>>
>>
>> And instead of redefining TASK_TOMBSTONE in the driver, you
>> may simply use IS_ERR_OR_NULL(tsk) to cover both NULL case
>> and kernel event.
>>
>
> Ugh sorry, sent out v2 exporting is_kernel_event() before seeing
> this comment, I will resend.
Saw that. I would say, wait until someone complains about that. If
people are Ok with exporting it, it is fine. I guess it will be useful.
You could fall back to this approach if there is resistance.
Cheers
Suzuki
Powered by blists - more mailing lists