[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bf8ed70f-e250-4319-86f9-9a7bc9aadb05@amd.com>
Date: Fri, 3 Jan 2025 09:54:09 +0530
From: Ravi Bangoria <ravi.bangoria@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "mingo@...nel.org" <mingo@...nel.org>,
"lucas.demarchi@...el.com" <lucas.demarchi@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"willy@...radead.org" <willy@...radead.org>,
"acme@...nel.org" <acme@...nel.org>,
"namhyung@...nel.org" <namhyung@...nel.org>,
"mark.rutland@....com" <mark.rutland@....com>,
"alexander.shishkin@...ux.intel.com" <alexander.shishkin@...ux.intel.com>,
"jolsa@...nel.org" <jolsa@...nel.org>,
"irogers@...gle.com" <irogers@...gle.com>,
"adrian.hunter@...el.com" <adrian.hunter@...el.com>,
"kan.liang@...ux.intel.com" <kan.liang@...ux.intel.com>,
Ravi Bangoria <ravi.bangoria@....com>
Subject: Re: [PATCH 19/19] perf: Make perf_pmu_unregister() useable
Hi Peter,
Sorry for the delay. Was on vacation.
>>>> In any case, below sequence of operations triggers a splat when
>>>> perf_mmap_close() tries to access event->rb, event->pmu etc. which
>>>> were already freed by __pmu_detach_event().
>>>>
>>>> Sequence:
>>>>
>>>> Kernel Userspace
>>>> ------ ---------
>>>> perf_pmu_register()
>>>> fd = perf_event_open()
>>>> p = mmap(fd)
>>>> perf_pmu_unregister()
>>>> munmap(p)
>>>> close(fd)
>>>
>>> Right, let me go have a look. Thanks!
>>
>> Bah, that's a right mess indeed, however did I miss all that.
>>
>> The easiest solution is probably to leave the RB around on detach, but
>> now I need to remember why I did that in the first place :/
>>
>> Oh.. I think I mostly that to serialize against perf_mmap(), which
>> should reject creating further maps. But we can do that without actually
>> detaching the RB -- we only need to acquire and release mmap_mutex.
>>
>> Ah, there's that perf_event_stop() inside of ring_buffer_attach(), that
>> must not happen after detach, obviously. So that must be dealt with.
>>
>> Hmm, also if we leave ->rb around, then we need to deal with
>> perf_event_set_output(), someone could try and redirect their things
>> into our buffer -- which isn't technically broken, but still weird.
>>
>> Something like the below.
>>
>> How did you test; perf-fuzzer or something?
>
> Prepared a simple test that does pmu register(), unregister() and
> "perf record" in parallel. It's quite dirty, I'll clean it up and
> share it here.
>
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1742,7 +1742,7 @@ static inline bool needs_branch_stack(st
>>
>> static inline bool has_aux(struct perf_event *event)
>> {
>> - return event->pmu->setup_aux;
>> + return event->pmu && event->pmu->setup_aux;
>> }
>>
>> static inline bool has_aux_action(struct perf_event *event)
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -5409,7 +5409,6 @@ static void _free_event(struct perf_even
>> security_perf_event_free(event);
>>
>> if (event->rb) {
>> - WARN_ON_ONCE(!event->pmu);
>> /*
>> * Can happen when we close an event with re-directed output.
>> *
>> @@ -12023,7 +12022,10 @@ static void __pmu_detach_event(struct pm
>> */
>> scoped_guard (mutex, &event->mmap_mutex) {
>> WARN_ON_ONCE(pmu->event_unmapped);
>> - ring_buffer_attach(event, NULL);
>> + /*
>> + * Mostly an empy lock sequence, such that perf_mmap(), which
>> + * relies on mmap_mutex, is sure to observe the state change.
>> + */
>> }
>>
>> perf_event_free_bpf_prog(event);
>> @@ -12823,6 +12825,9 @@ perf_event_set_output(struct perf_event
>> goto unlock;
>>
>> if (output_event) {
>> + if (output_event->state <= PERF_EVENT_STATE_REVOKED)
>> + goto unlock;
>> +
>> /* get the rb we want to redirect to */
>> rb = ring_buffer_get(output_event);
>> if (!rb)
>
> I needed this additional diff on top of your change. With this, it survives
> my test. perf_mmap_close() change seems correct. Not sure about perf_mmap().
> I'll inspect the code further.
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6540,7 +6540,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
> bool detach_rest = false;
>
> /* FIXIES vs perf_pmu_unregister() */
> - if (event->pmu->event_unmapped)
> + if (event->pmu && event->pmu->event_unmapped)
> event->pmu->event_unmapped(event, vma->vm_mm);
>
> /*
> @@ -6873,7 +6873,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
> vma->vm_ops = &perf_mmap_vmops;
>
> - if (!ret && event->pmu->event_mapped)
> + if (!ret && event->pmu && event->pmu->event_mapped)
> event->pmu->event_mapped(event, vma->vm_mm);
>
> return ret;
Both of these are incorrect. They just reduce the race window, doesn't
actually solve the race. Anyway, I could spot few other races:
1) A race between event creation and perf_pmu_unregister(). Any event
create code path (perf_event_open(), perf_event_create_kernel_counter()
and inherit_event()) allocates event with perf_event_alloc() which adds
an event to the pmu->events list. However, the event is still immature,
for ex, event->ctx is still NULL. In the mean time, perf_pmu_unregister()
finds this event and tries to detach it.
perf_event_open() perf_pmu_unregister()
event = perf_event_alloc() pmu_detach_event(event)
list_add(&event->pmu_list, &pmu->events); perf_event_ctx_lock(event)
/* perf_event_ctx_lock_nested(ctx)
* event->ctx is NULL. ctx = READ_ONCE(event->ctx); /* event->ctx is NULL */
*/ if (!refcount_inc_not_zero(&ctx->refcount)) { /* Crash */
perf_install_in_context(ctx, event);
2) A race with perf_event_release_kernel(). perf_event_release_kernel()
prepares a separate "free_list" of all children events under ctx->mutex
and event->child_mutex. However, the "free_list" uses the same
"event->child_list" for entries. OTOH, perf_pmu_unregister() ultimately
calls __perf_remove_from_context() with DETACH_CHILD, which checks if
the event being removed is a child event, and if so, it will try to
detach the child from parent using list_del_init(&event->child_list);
i.e. two code path doing list_del on the same list entry.
perf_event_release_kernel() perf_pmu_unregister()
/* Move children events to free_list */ ...
list_for_each_entry_safe(child, tmp, &free_list, child_list) { perf_remove_from_context() /* with DETACH_CHILD */
... __perf_remove_from_context()
list_del(&child->child_list); perf_child_detach()
list_del_init(&event->child_list);
3) A WARN(), not a race. perf_pmu_unregister() increments event->refcount
before detaching the event. If perf_pmu_unregister() picks up a child
event, perf_event_exit_event() called through perf_pmu_unregister()
will try to free it. Since event->refcount would be 2, free_event()
will trigger a WARN().
perf_pmu_unregister()
event = pmu_get_event() /* event->refcount => 2 */
...
perf_event_exit_event()
if (parent_event) { /* true, because `event` is a child */
free_event(event);
if (WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1,
"unexpected event refcount: %ld; ptr=%p\n",
atomic_long_read(&event->refcount), event))
4) A race with perf_event_set_bpf_prog(). perf_event_set_bpf_prog() might
be in process of setting event->prog, where as perf_pmu_unregister(),
which internally calls perf_event_free_bpf_prog(), will clear the
event->prog pointer.
perf_pmu_unregister() perf_event_set_bpf_prog()
... perf_event_set_bpf_handler()
perf_event_free_bpf_prog() event->prog = prog;
event->prog = NULL;
I've yet to inspect other code paths, so there might be more races.
Thinking loud, a plausible brute force solution is to introduce "event
specific lock" and acquire it right at the beginning of all code paths
and release it at the end. event->lock shouldn't create any contention,
since event would mostly be going through only one code path at any
point in time.
Thanks,
Ravi
Powered by blists - more mailing lists