[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b920bab-23a2-4a8d-95c2-b69472d38373@suse.com>
Date: Mon, 27 May 2024 11:36:55 +0200
From: Petr Pavlu <petr.pavlu@...e.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-rt-users <linux-rt-users@...r.kernel.org>
Subject: Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize
checks
On 5/20/24 15:50, Steven Rostedt wrote:
> On Fri, 17 May 2024 15:40:08 +0200
> Petr Pavlu <petr.pavlu@...e.com> wrote:
>
>> The reader code in rb_get_reader_page() swaps a new reader page into the
>> ring buffer by doing cmpxchg on old->list.prev->next to point it to the
>> new page. Following that, if the operation is successful,
>> old->list.next->prev gets updated too. This means the underlying
>> doubly-linked list is temporarily inconsistent, page->prev->next or
>> page->next->prev might not be equal back to page for some page in the
>> ring buffer.
>>
>> The resize operation in ring_buffer_resize() can be invoked in parallel.
>> It calls rb_check_pages() which can detect the described inconsistency
>> and stop further tracing:
>>
>> [ 190.271762] ------------[ cut here ]------------
>> [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0
>> [ 190.271789] Modules linked in: [...]
>> [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1
>> [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
>> [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
>> [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
>> [ 190.272023] Code: [...]
>> [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206
>> [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80
>> [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700
>> [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000
>> [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720
>> [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000
>> [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000
>> [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0
>> [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 190.272077] Call Trace:
>> [ 190.272098] <TASK>
>> [ 190.272189] ring_buffer_resize+0x2ab/0x460
>> [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0
>> [ 190.272206] tracing_resize_ring_buffer+0x65/0x90
>> [ 190.272216] tracing_entries_write+0x74/0xc0
>> [ 190.272225] vfs_write+0xf5/0x420
>> [ 190.272248] ksys_write+0x67/0xe0
>> [ 190.272256] do_syscall_64+0x82/0x170
>> [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [ 190.272373] RIP: 0033:0x7f1bd657d263
>> [ 190.272381] Code: [...]
>> [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263
>> [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001
>> [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000
>> [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500
>> [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002
>> [ 190.272412] </TASK>
>> [ 190.272414] ---[ end trace 0000000000000000 ]---
>>
>> Note that ring_buffer_resize() calls rb_check_pages() only if the parent
>> trace_buffer has recording disabled. Recent commit d78ab792705c
>> ("tracing: Stop current tracer when resizing buffer") causes that it is
>> now always the case which makes it more likely to experience this issue.
>>
>> The window to hit this race is nonetheless very small. To help
>> reproducing it, one can add a delay loop in rb_get_reader_page():
>>
>> ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
>> if (!ret)
>> goto spin;
>> for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */
>> __asm__ __volatile__ ("" : : : "memory");
>> rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list;
>>
>> .. and then run the following commands on the target system:
>>
>> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>> while true; do
>> echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>> echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>> done &
>> while true; do
>> for i in /sys/kernel/tracing/per_cpu/*; do
>> timeout 0.1 cat $i/trace_pipe; sleep 0.2
>> done
>> done
>>
>> To fix the problem, make sure ring_buffer_resize() doesn't invoke
>> rb_check_pages() concurrently with a reader operating on the same
>> ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.
>
> Definitely a bug. Thanks for catching it. But...
>
>>
>> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read")
>> Signed-off-by: Petr Pavlu <petr.pavlu@...e.com>
>> ---
>> kernel/trace/ring_buffer.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index 0ae569eae55a..967655591719 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -1449,6 +1449,11 @@ static void rb_check_bpage(struct ring_buffer_per_cpu *cpu_buffer,
>> *
>> * As a safety measure we check to make sure the data pages have not
>> * been corrupted.
>> + *
>> + * Callers of this function need to guarantee that the list of pages doesn't get
>> + * modified during the check. In particular, if it's possible that the function
>> + * is invoked with concurrent readers which can swap in a new reader page then
>> + * the caller should take cpu_buffer->reader_lock.
>> */
>> static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
>> {
>> @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
>> */
>> synchronize_rcu();
>> for_each_buffer_cpu(buffer, cpu) {
>> + unsigned long flags;
>> +
>> cpu_buffer = buffer->buffers[cpu];
>> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
>> rb_check_pages(cpu_buffer);
>> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock,
>> + flags);
>
> Putting my RT hat on, I really don't like the above fix. The
> rb_check_pages() iterates all subbuffers which makes the time interrupts
> are disabled non-deterministic.
I see, this applies also to the same rb_check_pages() validation invoked
from ring_buffer_read_finish().
>
> Instead, I would rather have something where we disable readers while we do
> the check, and re-enable them.
>
> raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> cpu_buffer->read_disabled++;
> raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>
> // Also, don't put flags on a new line. We are allow to go 100 characters now.
Noted.
>
>
> rb_check_pages(cpu_buffer);
> raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> cpu_buffer->read_disabled--;
> raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>
> Or something like that. Yes, that also requires creating a new
> "read_disabled" field in the ring_buffer_per_cpu code.
I think this would work but I'm personally not immediately sold on this
approach. If I understand the idea correctly, readers should then check
whether cpu_buffer->read_disabled is set and bail out with some error if
that is the case. The rb_check_pages() function is only a self-check
code and as such, I feel it doesn't justify disrupting readers with
a new error condition and adding more complex locking.
I've been considering how to approach this RT issue differently. One
obvious approach would be to drop the rb_check_pages() validation but
that is probably not desirable.
Another option could be to make the check less thorough and walk only
a part of the list which is bounded by some constant, typically one
would want to check the part where some change was just made. In case of
a smaller list, it should be still possible to traverse it completely.
This is an idea that I'm currently looking at.
>
> That said, I'm going to accept these patches as is (moving flags onto the
> same line). But would like the above code for the next merge window as it
> would then not affect RT.
>
> I'll accept these patches because it does fix the bug now.
Thanks,
Petr
Powered by blists - more mailing lists