[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113100524.5c5f6bdc@gandalf.local.home>
Date: Thu, 13 Nov 2025 10:05:24 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Yongliang Gao <leonylgao@...il.com>, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, frankjpliu@...cent.com, Yongliang Gao
<leonylgao@...cent.com>, Huang Cun <cunhuang@...cent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention
On Thu, 13 Nov 2025 15:15:15 +0100
Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
> I don't see why RCU work shouldn't work here.
> If a pid is removed then it might result that a chunk cleared/ removed
> then upper_chunk/ lower_chunk can become NULL. The buffer itself can be
> reused and point to something else. It could lear to a false outcome in
> test_bit(). This is handled by read_seqcount_retry().
>
> You could assign upper1, upper2, to NULL/ buffer as now and the removal
> (in put_lower_chunk(), put_upper_chunk()) delay to RCU so it happens
> after the grace period. That would allow you to iterate over it in
> trace_pid_list_is_set() lockless since the buffer will not disappear and
> will not be reused for another task until after all reader left.
The chunks are reused. When they are cleared, they go to a free-list, and
when a new chunk is needed, it is taken from the free-list so that we do
not need to allocate.
When all the bits in a chunk are cleared it moves to the free-list. This
can happen in exit, when a task exits and it clears its pid bit from the
pid_list.
Then a chunk can be allocated in fork, when a task whose pid is in the
pid_list adds its child to the list as well.
This means that the chunks are not being freed and we can't be doing
synchronize_rcu() in every exit.
>
> Additionally it would guarantee that the buffer is not released in
> trace_pid_list_free(). I don't see how the seqcount ensures that the
> buffer is not gone. I mean you could have a reader and the retry would
> force you to do another loop but before that happens you dereference the
> upper_chunk pointer which could be reused.
This protection has nothing to do with trace_pid_list_free(). In fact,
you'll notice that function doesn't even have any locking. That's because
the pid_list itself is removed from view and RCU synchronization happens
before that function is called.
The protection in trace_pid_list_is_set() is only to synchronize with the
adding and removing of the bits in the updates in exit and fork as well as
with the user manually writing into the set_*_pid files.
>
> So I *think* the RCU approach should be doable and cover this.
Where would you put the synchronize_rcu()? In do_exit()?
Also understanding what this is used for helps in understanding the scope
of protection needed.
The pid_list is created when you add anything into one of the pid files in
tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid:
# cd /sys/kernel/tracing
# echo $$ > set_ftrace_pid
# echo 1 > options/function-fork
# cat set_ftrace_pid
2716
2936
# cat set_ftrace_pid
2716
2945
What the above did was to create a pid_list for the function tracer. I
added the bash process pid using $$ (2716). Then when I cat the file, it
showed the pid for the bash process as well as the pid for the cat process,
as the cat process is a child of the bash process. The function-fork option
means to add any child process to the set_ftrace_pid if the parent is
already in the list. It also means to remove the pid if a process in the
list exits.
When I enable function tracing, it will only trace the bash process and any
of its children:
# echo 0 > tracing_on
# echo function > current_tracer
# cat set_ftrace_pid ; echo 0 > tracing_on
2716
2989
# cat trace
[..]
bash-2716 [003] ..... 36854.662833: rcu_read_lock_held <-mtree_range_walk
bash-2716 [003] ..... 36854.662834: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
bash-2716 [003] ..... 36854.662834: rcu_read_lock_held <-vma_start_read
##### CPU 6 buffer started ####
cat-2989 [006] d..2. 36854.662834: ret_from_fork <-ret_from_fork_asm
bash-2716 [003] ..... 36854.662835: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
cat-2989 [006] d..2. 36854.662836: schedule_tail <-ret_from_fork
bash-2716 [003] ..... 36854.662836: __rcu_read_unlock <-lock_vma_under_rcu
cat-2989 [006] d..2. 36854.662836: finish_task_switch.isra.0 <-schedule_tail
bash-2716 [003] ..... 36854.662836: handle_mm_fault <-do_user_addr_fault
[..]
It would be way too expensive to check the pid_list at *every* function
call. But luckily we don't have to. Instead, we set a per-cpu flag in the
instance trace_array on sched_switch if the next pid is in the pid_list and
clear it if it is not. (See ftrace_filter_pid_sched_switch_probe()).
This means, the bit being checked in the pid_list is always for a task that
is about to run.
The bit being cleared, is always for that task that is exiting (except for
the case of manual updates).
What we are protecting against is when one chunk is freed, but then
allocated again for a different set of PIDs. Where the reader has the chunk,
it was freed and re-allocated and the bit that is about to be checked
doesn't represent the bit it is checking for.
CPU1 CPU2
---- ----
trace_pid_list_is_set() {
lower_chunk = upper_chunk->data[upper2];
do_exit() {
trace_pid_list_clear() {
lower_chunk = upper_chunk->data[upper2]; // same lower_chunk
clear_bit(lower, lower_chunk->data);
if (find_first_bit(lower_chunk->data, LOWER_MAX) >= LOWER_MAX)
put_lower_chunk(pid_list, lower_chunk);
[..]
}
}
fork() {
trace_pid_list_set() {
lower_chunk = get_lower_chunk(pid_list);
upper_chunk->data[upper2] = lower_chunk; // upper2 is a different value here
set_bit(lower, lower_chunk->data);
}
}
ret = test_bit(lower, lower_chunk->data);
And if the "lower" bit matches the set_bit from CPU2, we have a false
positive. Although, this race is highly unlikely, we should still protect
against it (it could happen on a VM vCPU that was preempted in
trace_pid_list_is_set()).
-- Steve
Powered by blists - more mailing lists