linux-kernel - Re: [PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb6c9771-1a7c-4367-bb14-63524d9c560a@suse.com>
Date:   Wed, 29 Nov 2023 10:22:23 +0100
From:   Petr Pavlu <petr.pavlu@...e.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     mhiramat@...nel.org, mathieu.desnoyers@...icios.com,
        zhengyejian1@...wei.com, linux-trace-kernel@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] tracing: Simplify and fix "buffered event"
 synchronization

On 11/28/23 16:27, Steven Rostedt wrote:
> On Tue, 28 Nov 2023 16:05:00 +0100
> Petr Pavlu <petr.pavlu@...e.com> wrote:
> 
>> On 11/27/23 18:41, Steven Rostedt wrote:
>>> On Mon, 27 Nov 2023 16:12:47 +0100
>>> Petr Pavlu <petr.pavlu@...e.com> wrote:
>>>   
>>>> The following warning appears when using buffered events:
>>>> [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420  
>>>
>>> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
>>> in your version?  
>>
>> The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
>> code and RB_WARN_ON():
> 
> Interesting.
> 
>>   3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
>>   3896                                  struct ring_buffer_event *event)
>>   3897  {
>> [...]
>>   3908           * This must only be called if the event has not been
>>   3909           * committed yet. Thus we can assume that preemption
>>   3910           * is still disabled.
>>   3911           */
>>   3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
>>   3913  
>>   3914          rb_decrement_entry(cpu_buffer, event);
>>   3915          if (rb_try_to_discard(cpu_buffer, event))
>>   3916                  goto out;
>>
>> https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912
>>
>>>> [...]
>>>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
> 
> I was looking at the above 6.7.0-rc2-default, which made me somewhat confused.
> 
> 
>>>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>>>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>>>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 
> 
>>>> * Function trace_buffered_event_enable() initializes memory of each
>>>>   allocated trace_buffered_event to zero but no memory barrier is
>>>>   present to ensure this operation is completed before publishing its
>>>>   pointer for use in trace_event_buffer_lock_reserve().  
>>>
>>> Yes, a wmb() may be helpful, but as the reader will see either the
>>> allocated page or NULL, and can handle both cases. I decided not to add it
>>> because it really doesn't matter if we do.  
>>
>> My concern is about the earlier call "memset(event, 0, sizeof(*event));"
>> in trace_buffered_event_enable() which writes the obtained event page
>> prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
>> I think another CPU running trace_event_buffer_lock_reserve() could see
>> already the valid event pointer, write to the event page and only then
>> the memory could see the memset() write coming from
>> trace_buffered_event_enable(). The event data would then become
>> corrupted.
> 
> OK, I'll buy that. Feel free to send an patch that adds that wmb with a
> comment explaining the above.

Sure, I can do that. A nice thing about my proposed patch is that it
requires only one smp_wmb() in this case, while the current code will
require adding this barrier in the allocation loop prior to writing
trace_buffered_event.

>>>   
>>>> * Calling function trace_buffered_event_enable() normally requires
>>>>   pairing it with trace_buffered_event_disable(). However, the function
>>>>   has no error return code and in case of a failure decrements back
>>>>   trace_buffered_event_ref. This results in underflow of the counter
>>>>   when trace_buffered_event_disable() gets called later.  
>>>
>>> Not an issue.
>>>
>>> Even on failure the ref count will be greater than zero, where 
>>> trace_buffered_event_disable() handles it properly, and the freeing can
>>> handle no pages being allocated.
>>>
>>> That is:
>>>
>>> 	free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>>>
>>> would be the same as: free_page((unsigned long)NULL);
>>>
>>> Which is perfectly fine to do.  
>>
>> The situation that I have on my mind is the following:
>> * The counter trace_buffered_event_ref is at 0.
>> * The soft mode gets enabled for some event and
>>   trace_buffered_event_enable() is called. The function increments
>>   trace_buffered_event_ref to 1 and starts allocating event pages.
>> * The allocation fails for some page and trace_buffered_event_disable()
>>   is called for cleanup.
>> * Function trace_buffered_event_disable() decrements
>>   trace_buffered_event_ref back to 0, recognizes that it was the last
>>   use of buffered events and frees all allocated pages.
>> * The control goes back to trace_buffered_event_enable() which returns.
>>   The caller of trace_buffered_event_enable() has no information that
>>   the function actually failed.
>> * Some time later, the soft mode is disabled for the same event.
>>   Function trace_buffered_event_disable() is called. It warns on
>>   "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
>>
>> The end result is only an ugly warning but I think it is still good to
>> fix.
> 
> Ah, I missed the call to trace_buffered_event_disable() in the failure
> path. Actually, since the trace_buffered_event is only an optimization, I
> think the best thing to do is to just return without the clean up,
> expecting that the other caller will do the clean up. We could even not
> even stop the loop!
> 
> That is, just have:
> 
> void trace_buffered_event_enable(void)
> {
> 	struct ring_buffer_event *event;
> 	struct page *page;
> 	int cpu;
> 
> 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
> 	if (trace_buffered_event_ref++)
> 		return;
> 
> 	for_each_tracing_cpu(cpu) {
> 		page = alloc_pages_node(cpu_to_node(cpu),
> 					GFP_KERNEL | __GFP_NORETRY, 0);
> 
> 		/* This is just an optimization and can handle failures */
> 		if (!page) {
> 			pr_err("Failed to create event buffer\n");
> 			break;
> 		}
> 
> 		event = page_address(page);
> 		memset(event, 0, sizeof(*event));
> 
> 		per_cpu(trace_buffered_event, cpu) = event;
> 
> 		preempt_disable();
> 		if (cpu == smp_processor_id() &&
> 		    __this_cpu_read(trace_buffered_event) !=
> 		    per_cpu(trace_buffered_event, cpu))
> 			WARN_ON_ONCE(1);
> 		preempt_enable();
> 	}
> }

The current behavior is that if the allocation in one call to
trace_buffered_event_enable() fails then another call to the same
function results in a new attempt to allocate the event pages.

The code above changes it. As long trace_buffered_event_ref doesn't go
back to 0 and then 1 again, no new allocation attempt is made. It means
that trace_buffered_event can be left in this "half-initialized" state
for a while.

This is possibly ok, it is an unlikely corner case anyway. My patch
nonetheless preserves the current behavior of retrying the allocation.

> 
>>>
>>> Now, the reason you found the crash was from the first issue you mentioned.
>>> That is, we didn't disable the current CPU and if we migrated, things would
>>> get really messed up. The only fix here is to make sure all CPUs have their
>>> trace_buffered_event_cnt get incremented.  
>>
>> The potential race that I see is the following:
>> * Function trace_buffered_event_disable() is called on CPU 0. It
>>   increments trace_buffered_event_cnt on each CPU and waits via
>>   synchronize_rcu() for each user of trace_buffered_event to complete.
> 
>> * After synchronize_rcu() is already finished, the situation is that all
>>   counters trace_buffered_event_cnt are at 1 and all pointers
>>   trace_buffered_event are still valid.
> 
>> * At this point, on a different CPU 1, the execution reaches
>>   trace_event_buffer_lock_reserve(). The function calls
>>   preempt_disable_notrace() and enters an RCU read-side critical section
>>   but that is not important because all RCU stuff is already performed
>>   and completed in trace_buffered_event_disable(). The function proceeds
>>   and reads a still valid pointer from trace_buffered_event[CPU1] into
>>   the local variable "entry". However, it doesn't yet read
>>   trace_buffered_event_cnt[CPU1] which happens later.
> 
>> * Function trace_buffered_event_disable() continues. It frees
>>   trace_buffered_event[CPU1] and decrements
>>   trace_buffered_event_cnt[CPU1] back to 0.
> 
>> * Function trace_event_buffer_lock_reserve() continues. It reads and
>>   increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
>>   believe that it can use the "entry" that it already obtained but the
>>   pointer is now invalid and any access results in a use-after-free.
> 
> OK, so you are saying we need another synchronize_rcu() after we set the
> trace_buffered_event to NULL?
> 
> 
> void trace_buffered_event_disable(void)
> {
> 	int cpu;
> 
> 	WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
> 	if (WARN_ON_ONCE(!trace_buffered_event_ref))
> 		return;
> 
> 	if (--trace_buffered_event_ref)
> 		return;
> 
> 	preempt_disable();
> 	/* For each CPU, set the buffer as used. */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       disable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> 
> 	/* Wait for all current users to finish */
> 	synchronize_rcu();
> 
> 	for_each_tracing_cpu(cpu) {
> 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 		per_cpu(trace_buffered_event, cpu) = NULL;
> 	}
> 	/*
> 	 * Make sure trace_buffered_event is NULL before clearing
> 	 * trace_buffered_event_cnt.
> 	 */
> -	smp_wmb();
> +	synchronize_rcu();
> 
> 	preempt_disable();
> 	/* Do the work on each cpu */
> 	smp_call_function_many(tracing_buffer_mask,
> 			       enable_trace_buffered_event, NULL, 1);
> 	preempt_enable();
> }
> 
> I agree with that.

Yes, I believe this should address this potential race condition.

An alternative would be instead to update
trace_event_buffer_lock_reserve() as follows:

	if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
		smp_rmb();
		if ((entry = __this_cpu_read(trace_buffered_event))) {
			[...]

That saves the synchronize_rcu() call but additionally modifies
trace_buffered_event_cnt even if trace_buffered_event is currently NULL.

Another alternative is the approach taken by my patch which avoids more
RCU work and unnecessary memory modifications.

I'd be interested if you could have a look again at what I'm proposing
in my patch. I think it simplifies the code while addressing these
problems as well. However, if you have reservations about that approach
then it is ok, I can fix the found problems individually as discussed.

Thanks,
Petr