[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP4=nvQnW5vS5CQBZtKp-BdjYxNFbq26P36uRy3RhCenHEG_YA@mail.gmail.com>
Date: Mon, 26 Aug 2024 15:01:24 +0200
From: Tomas Glozar <tglozar@...hat.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release
pá 23. 8. 2024 v 20:51 odesílatel Steven Rostedt <rostedt@...dmis.org> napsal:
>
> Egad, I don't think this is even good enough. I noticed this in the trace
> (adding kthread to the memset trace_printk):
>
> <...>-916 [003] ..... 134.227044: osnoise_workload_start: memset ffff88823c435b28 for 0000000000000000
> <...>-916 [003] ..... 134.227046: osnoise_workload_start: memset ffff88823c4b5b28 for 0000000000000000
> <...>-916 [003] ..... 134.227048: osnoise_workload_start: memset ffff88823c535b28 for 0000000000000000
> <...>-916 [003] ..... 134.227049: osnoise_workload_start: memset ffff88823c5b5b28 for 0000000000000000
> <...>-916 [003] ..... 134.227051: osnoise_workload_start: memset ffff88823c635b28 for 0000000000000000
> <...>-916 [003] ..... 134.227052: osnoise_workload_start: memset ffff88823c6b5b28 for 0000000000000000
> <...>-916 [003] ..... 134.227054: osnoise_workload_start: memset ffff88823c735b28 for ffff888108205640
> <...>-916 [003] ..... 134.227055: osnoise_workload_start: memset ffff88823c7b5b28 for 0000000000000000
>
> Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
> on me that this is a global per CPU variable. It gets initialized when the
> tracer starts. If another program is has the timerlat fd open when the
> tracer ends, the tracer starts again, and you close the fd, it will cancel
> the hrtimer for the new task.
>
> I think there needs to be some ref counting here, that keeps the tracer
> from starting again if there's still files opened.
>
The timerlat fd is not supposed to be open when the tracer ends/starts
again, since osnoise_workload_stop() calls stop_kthread(), which in
turn calls kill_pid() to SIGKILL the user workload, which will also
close the file descriptor. Only one PID per CPU should have the
timerlat fd open at one moment, since timerlat_fd_open() will refuse
to open if tlat->pid is set. It appears that this is somehow bypassed
and osnoise_workload_start() happens before closing the fd, not sure
why.
> This looks to be a bigger problem than I have time to work on it for now.
> I'll just apply the mutex patch for the kthreads, but this bug is going to
> take a bit more work in solving.
>
Yeah, unfortunately the issue looks more complicated now after looking
at the traces you posted, I will probably have to do more tracing to
see what is actually happening here. Thank you again for helping us
with this and also for the patch for the mutex.
Tomas
Powered by blists - more mailing lists