linux-kernel - Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240823145211.34ccda61@gandalf.local.home>
Date: Fri, 23 Aug 2024 14:52:11 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: tglozar@...hat.com
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
 jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in
 timerlat_fd_release

On Fri, 23 Aug 2024 12:54:26 -0400
Steven Rostedt <rostedt@...dmis.org> wrote:

> > $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
> >  kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done  
> 
> The "kill -INT $PID" caused the write to osnoise_workload_start(), and the
> after 1ms you do the "kill -TERM $PID" that kill the process which closes
> the file descriptor right after the reset.
> 
> The real fix here looks to be:
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 66a871553d4a..400a72cd6ab5 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -265,6 +265,8 @@ static inline void tlat_var_reset(void)
>  	 */
>  	for_each_cpu(cpu, cpu_online_mask) {
>  		tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
> +		if (tlat_var->kthread)
> +			hrtimer_cancel(&tlat_var->timer);
>  		memset(tlat_var, 0, sizeof(*tlat_var));
>  	}
>  }
> @@ -2579,7 +2581,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
>  	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
>  	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
>  
> -	hrtimer_cancel(&tlat_var->timer);
> +	if (tlat_var->kthread)
> +		hrtimer_cancel(&tlat_var->timer);
>  	memset(tlat_var, 0, sizeof(*tlat_var));
>  
>  	osn_var->sampling = 0;
> 
> I'll make this into a real patch and send it out.

Egad, I don't think this is even good enough. I noticed this in the trace
(adding kthread to the memset trace_printk):

           <...>-916     [003] .....   134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000
           <...>-916     [003] .....   134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000
           <...>-916     [003] .....   134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000
           <...>-916     [003] .....   134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000
           <...>-916     [003] .....   134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640
           <...>-916     [003] .....   134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000

Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
on me that this is a global per CPU variable. It gets initialized when the
tracer starts. If another program is has the timerlat fd open when the
tracer ends, the tracer starts again, and you close the fd, it will cancel
the hrtimer for the new task.

I think there needs to be some ref counting here, that keeps the tracer
from starting again if there's still files opened.

This looks to be a bigger problem than I have time to work on it for now.
I'll just apply the mutex patch for the kthreads, but this bug is going to
take a bit more work in solving.

-- Steve