[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113151729.4Zky6d-t@linutronix.de>
Date: Thu, 13 Nov 2025 16:17:29 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Yongliang Gao <leonylgao@...il.com>, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, frankjpliu@...cent.com,
Yongliang Gao <leonylgao@...cent.com>,
Huang Cun <cunhuang@...cent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention
On 2025-11-13 10:05:24 [-0500], Steven Rostedt wrote:
> This means that the chunks are not being freed and we can't be doing
> synchronize_rcu() in every exit.
You don't have to, you can do call_rcu().
> > Additionally it would guarantee that the buffer is not released in
> > trace_pid_list_free(). I don't see how the seqcount ensures that the
> > buffer is not gone. I mean you could have a reader and the retry would
> > force you to do another loop but before that happens you dereference the
> > upper_chunk pointer which could be reused.
>
> This protection has nothing to do with trace_pid_list_free(). In fact,
> you'll notice that function doesn't even have any locking. That's because
> the pid_list itself is removed from view and RCU synchronization happens
> before that function is called.
>
> The protection in trace_pid_list_is_set() is only to synchronize with the
> adding and removing of the bits in the updates in exit and fork as well as
> with the user manually writing into the set_*_pid files.
So if the kfree() is not an issue, it is just the use of the block from
the freelist which must not point to a wrong item? And therefore the
seqcount?
> > So I *think* the RCU approach should be doable and cover this.
>
> Where would you put the synchronize_rcu()? In do_exit()?
simply call_rcu() and let it move to the freelist.
> Also understanding what this is used for helps in understanding the scope
> of protection needed.
>
> The pid_list is created when you add anything into one of the pid files in
> tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid:
>
> # cd /sys/kernel/tracing
> # echo $$ > set_ftrace_pid
> # echo 1 > options/function-fork
> # cat set_ftrace_pid
> 2716
> 2936
> # cat set_ftrace_pid
> 2716
> 2945
>
> What the above did was to create a pid_list for the function tracer. I
> added the bash process pid using $$ (2716). Then when I cat the file, it
> showed the pid for the bash process as well as the pid for the cat process,
> as the cat process is a child of the bash process. The function-fork option
> means to add any child process to the set_ftrace_pid if the parent is
> already in the list. It also means to remove the pid if a process in the
> list exits.
This adding/ add-on-fork, removing and remove-on-exit is the only write
side?
> When I enable function tracing, it will only trace the bash process and any
> of its children:
>
> # echo 0 > tracing_on
> # echo function > current_tracer
> # cat set_ftrace_pid ; echo 0 > tracing_on
> 2716
> 2989
> # cat trace
> [..]
> bash-2716 [003] ..... 36854.662833: rcu_read_lock_held <-mtree_range_walk
> bash-2716 [003] ..... 36854.662834: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
> bash-2716 [003] ..... 36854.662834: rcu_read_lock_held <-vma_start_read
> ##### CPU 6 buffer started ####
> cat-2989 [006] d..2. 36854.662834: ret_from_fork <-ret_from_fork_asm
> bash-2716 [003] ..... 36854.662835: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
> cat-2989 [006] d..2. 36854.662836: schedule_tail <-ret_from_fork
> bash-2716 [003] ..... 36854.662836: __rcu_read_unlock <-lock_vma_under_rcu
> cat-2989 [006] d..2. 36854.662836: finish_task_switch.isra.0 <-schedule_tail
> bash-2716 [003] ..... 36854.662836: handle_mm_fault <-do_user_addr_fault
> [..]
>
> It would be way too expensive to check the pid_list at *every* function
> call. But luckily we don't have to. Instead, we set a per-cpu flag in the
> instance trace_array on sched_switch if the next pid is in the pid_list and
> clear it if it is not. (See ftrace_filter_pid_sched_switch_probe()).
>
> This means, the bit being checked in the pid_list is always for a task that
> is about to run.
>
> The bit being cleared, is always for that task that is exiting (except for
> the case of manual updates).
>
> What we are protecting against is when one chunk is freed, but then
> allocated again for a different set of PIDs. Where the reader has the chunk,
> it was freed and re-allocated and the bit that is about to be checked
> doesn't represent the bit it is checking for.
This I assumed.
And the kfree() at the end can not happen while there is still a reader?
…
> And if the "lower" bit matches the set_bit from CPU2, we have a false
> positive. Although, this race is highly unlikely, we should still protect
> against it (it could happen on a VM vCPU that was preempted in
> trace_pid_list_is_set()).
>
> -- Steve
Sebastian
Powered by blists - more mailing lists