[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240826132620.1618d201@gandalf.local.home>
Date: Mon, 26 Aug 2024 13:26:20 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Tomas Glozar <tglozar@...hat.com>
Cc: linux-trace-kernel@...r.kernel.org, linux-kernel@...r.kernel.org,
jkacur@...hat.com, "Luis Claudio R. Goncalves" <lgoncalv@...hat.com>
Subject: Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in
timerlat_fd_release
On Mon, 26 Aug 2024 15:01:24 +0200
Tomas Glozar <tglozar@...hat.com> wrote:
> > Before the reset, all but one of the tlat->kthread is NULL. Then it dawned
> > on me that this is a global per CPU variable. It gets initialized when the
> > tracer starts. If another program is has the timerlat fd open when the
> > tracer ends, the tracer starts again, and you close the fd, it will cancel
> > the hrtimer for the new task.
> >
> > I think there needs to be some ref counting here, that keeps the tracer
> > from starting again if there's still files opened.
> >
>
> The timerlat fd is not supposed to be open when the tracer ends/starts
> again, since osnoise_workload_stop() calls stop_kthread(), which in
> turn calls kill_pid() to SIGKILL the user workload, which will also
> close the file descriptor. Only one PID per CPU should have the
> timerlat fd open at one moment, since timerlat_fd_open() will refuse
> to open if tlat->pid is set. It appears that this is somehow bypassed
> and osnoise_workload_start() happens before closing the fd, not sure
> why.
>
> > This looks to be a bigger problem than I have time to work on it for now.
> > I'll just apply the mutex patch for the kthreads, but this bug is going to
> > take a bit more work in solving.
> >
>
> Yeah, unfortunately the issue looks more complicated now after looking
> at the traces you posted, I will probably have to do more tracing to
> see what is actually happening here. Thank you again for helping us
> with this and also for the patch for the mutex.
Yeah, I think I finally found the real issue. I don't think we need the ref
counting. The problem is the creating and killing of the threads via the
start and stop callbacks. That's not their purpose. The purpose of stop
and start callbacks is when tracing_on is set to off and back on again. I
think this is what is racing with the close.
Anyway, the start and stop should probably just pause the threads and not
kill them an start them again. That is, the osnoise_workload_start() should
be called by the init callbacks and the osnoise_workload_stop should be
called by reset callback.
The start and stop callbacks should just pause and restart the the threads.
-- Steve
Powered by blists - more mailing lists