linux-kernel - Re: [PATCH 19/19] perf: Make perf_pmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bf8ed70f-e250-4319-86f9-9a7bc9aadb05@amd.com>
Date: Fri, 3 Jan 2025 09:54:09 +0530
From: Ravi Bangoria <ravi.bangoria@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "mingo@...nel.org" <mingo@...nel.org>,
 "lucas.demarchi@...el.com" <lucas.demarchi@...el.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "willy@...radead.org" <willy@...radead.org>,
 "acme@...nel.org" <acme@...nel.org>,
 "namhyung@...nel.org" <namhyung@...nel.org>,
 "mark.rutland@....com" <mark.rutland@....com>,
 "alexander.shishkin@...ux.intel.com" <alexander.shishkin@...ux.intel.com>,
 "jolsa@...nel.org" <jolsa@...nel.org>,
 "irogers@...gle.com" <irogers@...gle.com>,
 "adrian.hunter@...el.com" <adrian.hunter@...el.com>,
 "kan.liang@...ux.intel.com" <kan.liang@...ux.intel.com>,
 Ravi Bangoria <ravi.bangoria@....com>
Subject: Re: [PATCH 19/19] perf: Make perf_pmu_unregister() useable

Hi Peter,

Sorry for the delay. Was on vacation.

>>>> In any case, below sequence of operations triggers a splat when
>>>> perf_mmap_close() tries to access event->rb, event->pmu etc. which
>>>> were already freed by __pmu_detach_event().
>>>>
>>>> Sequence:
>>>>
>>>>     Kernel                       Userspace
>>>>     ------                       ---------
>>>>     perf_pmu_register()
>>>>                                 fd = perf_event_open()
>>>>                                 p = mmap(fd)
>>>>     perf_pmu_unregister()
>>>>                                 munmap(p)
>>>>                                 close(fd)
>>>
>>> Right, let me go have a look. Thanks!
>>
>> Bah, that's a right mess indeed, however did I miss all that.
>>
>> The easiest solution is probably to leave the RB around on detach, but
>> now I need to remember why I did that in the first place :/
>>
>> Oh.. I think I mostly that to serialize against perf_mmap(), which
>> should reject creating further maps. But we can do that without actually
>> detaching the RB -- we only need to acquire and release mmap_mutex.
>>
>> Ah, there's that perf_event_stop() inside of ring_buffer_attach(), that
>> must not happen after detach, obviously. So that must be dealt with.
>>
>> Hmm, also if we leave ->rb around, then we need to deal with
>> perf_event_set_output(), someone could try and redirect their things
>> into our buffer -- which isn't technically broken, but still weird.
>>
>> Something like the below.
>>
>> How did you test; perf-fuzzer or something?
> 
> Prepared a simple test that does pmu register(), unregister() and
> "perf record" in parallel. It's quite dirty, I'll clean it up and
> share it here.
> 
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1742,7 +1742,7 @@ static inline bool needs_branch_stack(st
>>  
>>  static inline bool has_aux(struct perf_event *event)
>>  {
>> -	return event->pmu->setup_aux;
>> +	return event->pmu && event->pmu->setup_aux;
>>  }
>>  
>>  static inline bool has_aux_action(struct perf_event *event)
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -5409,7 +5409,6 @@ static void _free_event(struct perf_even
>>  	security_perf_event_free(event);
>>  
>>  	if (event->rb) {
>> -		WARN_ON_ONCE(!event->pmu);
>>  		/*
>>  		 * Can happen when we close an event with re-directed output.
>>  		 *
>> @@ -12023,7 +12022,10 @@ static void __pmu_detach_event(struct pm
>>  	 */
>>  	scoped_guard (mutex, &event->mmap_mutex) {
>>  		WARN_ON_ONCE(pmu->event_unmapped);
>> -		ring_buffer_attach(event, NULL);
>> +		/* 
>> +		 * Mostly an empy lock sequence, such that perf_mmap(), which
>> +		 * relies on mmap_mutex, is sure to observe the state change.
>> +		 */
>>  	}
>>  
>>  	perf_event_free_bpf_prog(event);
>> @@ -12823,6 +12825,9 @@ perf_event_set_output(struct perf_event
>>  		goto unlock;
>>  
>>  	if (output_event) {
>> +		if (output_event->state <= PERF_EVENT_STATE_REVOKED)
>> +			goto unlock;
>> +
>>  		/* get the rb we want to redirect to */
>>  		rb = ring_buffer_get(output_event);
>>  		if (!rb)
> 
> I needed this additional diff on top of your change. With this, it survives
> my test. perf_mmap_close() change seems correct. Not sure about perf_mmap().
> I'll inspect the code further.
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6540,7 +6540,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
>  	bool detach_rest = false;
>  
>  	/* FIXIES vs perf_pmu_unregister() */
> -	if (event->pmu->event_unmapped)
> +	if (event->pmu && event->pmu->event_unmapped)
>  		event->pmu->event_unmapped(event, vma->vm_mm);
>  
>  	/*
> @@ -6873,7 +6873,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  	vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
>  	vma->vm_ops = &perf_mmap_vmops;
>  
> -	if (!ret && event->pmu->event_mapped)
> +	if (!ret && event->pmu && event->pmu->event_mapped)
>  		event->pmu->event_mapped(event, vma->vm_mm);
>  
>  	return ret;

Both of these are incorrect. They just reduce the race window, doesn't
actually solve the race. Anyway, I could spot few other races:

1) A race between event creation and perf_pmu_unregister(). Any event
   create code path (perf_event_open(), perf_event_create_kernel_counter()
   and inherit_event()) allocates event with perf_event_alloc() which adds
   an event to the pmu->events list. However, the event is still immature,
   for ex, event->ctx is still NULL. In the mean time, perf_pmu_unregister()
   finds this event and tries to detach it.

   perf_event_open()                              perf_pmu_unregister()
     event = perf_event_alloc()                     pmu_detach_event(event)
       list_add(&event->pmu_list, &pmu->events);      perf_event_ctx_lock(event)
       /*                                               perf_event_ctx_lock_nested(ctx)
        * event->ctx is NULL.                             ctx = READ_ONCE(event->ctx); /* event->ctx is NULL */
        */                                                if (!refcount_inc_not_zero(&ctx->refcount)) { /* Crash */
     perf_install_in_context(ctx, event);

2) A race with perf_event_release_kernel(). perf_event_release_kernel()
   prepares a separate "free_list" of all children events under ctx->mutex
   and event->child_mutex. However, the "free_list" uses the same
   "event->child_list" for entries. OTOH, perf_pmu_unregister() ultimately
   calls __perf_remove_from_context() with DETACH_CHILD, which checks if
   the event being removed is a child event, and if so, it will try to
   detach the child from parent using list_del_init(&event->child_list);
   i.e. two code path doing list_del on the same list entry.

   perf_event_release_kernel()                                        perf_pmu_unregister()
     /* Move children events to free_list */                            ...
     list_for_each_entry_safe(child, tmp, &free_list, child_list) {       perf_remove_from_context() /* with DETACH_CHILD */
       ...                                                                  __perf_remove_from_context()
       list_del(&child->child_list);                                          perf_child_detach()
                                                                                list_del_init(&event->child_list);

3) A WARN(), not a race. perf_pmu_unregister() increments event->refcount
   before detaching the event. If perf_pmu_unregister() picks up a child
   event, perf_event_exit_event() called through perf_pmu_unregister()
   will try to free it. Since event->refcount would be 2, free_event()
   will trigger a WARN().

   perf_pmu_unregister()
     event = pmu_get_event() /* event->refcount => 2 */
       ...
         perf_event_exit_event()
	   if (parent_event) { /* true, because `event` is a child */
	     free_event(event);
	       if (WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1,
                        "unexpected event refcount: %ld; ptr=%p\n",
                        atomic_long_read(&event->refcount), event))

4) A race with perf_event_set_bpf_prog(). perf_event_set_bpf_prog() might
   be in process of setting event->prog, where as perf_pmu_unregister(),
   which internally calls perf_event_free_bpf_prog(), will clear the
   event->prog pointer.

   perf_pmu_unregister()                perf_event_set_bpf_prog()
     ...                                  perf_event_set_bpf_handler()
       perf_event_free_bpf_prog()           event->prog = prog;
         event->prog = NULL;

I've yet to inspect other code paths, so there might be more races.

Thinking loud, a plausible brute force solution is to introduce "event
specific lock" and acquire it right at the beginning of all code paths
and release it at the end. event->lock shouldn't create any contention,
since event would mostly be going through only one code path at any
point in time.

Thanks,
Ravi