[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bbb834f13cb1480b2517497481a1b3e0137d089a.camel@redhat.com>
Date: Wed, 14 Jan 2026 13:49:17 -0600
From: Crystal Wood <crwood@...hat.com>
To: Tomas Glozar <tglozar@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, John Kacur
<jkacur@...hat.com>, Luis Goncalves <lgoncalv@...hat.com>, LKML
<linux-kernel@...r.kernel.org>, Linux Trace Kernel
<linux-trace-kernel@...r.kernel.org>
Subject: Re: [PATCH] tracing/osnoise: Fix OSN_WORKLOAD-related crash
On Wed, 2026-01-14 at 13:35 +0100, Tomas Glozar wrote:
> A kernel panic was observed in the timerlat tracer with the following
> reproducer:
>
> #!/bin/bash
> while true; do
> rtla timerlat hist -u -d 5s & PID=$!
> sleep 2
> echo OSNOISE_WORKLOAD > /sys/kernel/tracing/osnoise/options
> rtla timerlat hist -k -d 1s
> done
>
> The kernel first displays several WARN traces with the following pattern:
>
> WARNING: CPU: 1 PID: 1822 at kernel/trace/trace_osnoise.c:1959 stop_kthread+0xb7/0xc0
The line number doesn't match up for me; is this the first or second
WARN_ON in that function?
> and finally a null pointer reference BUG:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000030
> ...
> CPU: 1 UID: 0 PID: 2155 Comm: timerlatu/1
> ...
> Call Trace:
> ...
> ? timerlat_fd_read+0xf2/0x370
> ? timerlat_fd_read+0xee/0x370
> vfs_read+0xe8/0x370
> ksys_read+0x6d/0xf0
> do_syscall_64+0x7d/0x160
> ...
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
What's the actual fault location? And those ? lines in the call trace
are "considered to be additional clues" rather than actual unwound
frames; what was in the ... above them?
> static int osnoise_options_open(struct inode *inode, struct file *file)
> {
> return seq_open(file, &osnoise_options_seq_ops);
> @@ -2229,6 +2254,10 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
> if (option < 0)
> return -EINVAL;
>
> + retval = osnoise_validate_option(option, enable);
> + if (retval != 0)
> + return retval;
> +
> /*
> * trace_types_lock is taken to avoid concurrency on start/stop.
> */
Shouldn't this be done under interface_lock to avoid concurrent
timerlat_fd_open()? FWIW, your test script doesn't appear to cover the
case of option setting racing with timerlat starting (due to the 2
second delay).
Of course, this is complicated by stop_per_cpu_kthreads() happening
before interface_lock is acquired. Do we know why that happens outside
the lock? That might even be the actual cause of this bug.
Though even in the non-race case, we might still want to return -EBUSY
rather than just killing the thread (which might still have races since
we don't wait for the user thread to die).
-Crystal
Powered by blists - more mailing lists