linux-kernel - Re: [PATCH v3] trace/pid_list: optimize pid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113100524.5c5f6bdc@gandalf.local.home>
Date: Thu, 13 Nov 2025 10:05:24 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Yongliang Gao <leonylgao@...il.com>, mhiramat@...nel.org,
 mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, frankjpliu@...cent.com, Yongliang Gao
 <leonylgao@...cent.com>, Huang Cun <cunhuang@...cent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention

On Thu, 13 Nov 2025 15:15:15 +0100
Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:

> I don't see why RCU work shouldn't work here.
> If a pid is removed then it might result that a chunk cleared/ removed
> then upper_chunk/ lower_chunk can become NULL. The buffer itself can be
> reused and point to something else. It could lear to a false outcome in
> test_bit(). This is handled by read_seqcount_retry().
> 
> You could assign upper1, upper2, to NULL/ buffer as now and the removal
> (in put_lower_chunk(), put_upper_chunk()) delay to RCU so it happens
> after the grace period. That would allow you to iterate over it in
> trace_pid_list_is_set() lockless since the buffer will not disappear and
> will not be reused for another task until after all reader left.

The chunks are reused. When they are cleared, they go to a free-list, and
when a new chunk is needed, it is taken from the free-list so that we do
not need to allocate.

When all the bits in a chunk are cleared it moves to the free-list. This
can happen in exit, when a task exits and it clears its pid bit from the
pid_list.

Then a chunk can be allocated in fork, when a task whose pid is in the
pid_list adds its child to the list as well.

This means that the chunks are not being freed and we can't be doing
synchronize_rcu() in every exit.

> 
> Additionally it would guarantee that the buffer is not released in
> trace_pid_list_free(). I don't see how the seqcount ensures that the
> buffer is not gone. I mean you could have a reader and the retry would
> force you to do another loop but before that happens you dereference the
> upper_chunk pointer which could be reused.

This protection has nothing to do with trace_pid_list_free(). In fact,
you'll notice that function doesn't even have any locking. That's because
the pid_list itself is removed from view and RCU synchronization happens
before that function is called.

The protection in trace_pid_list_is_set() is only to synchronize with the
adding and removing of the bits in the updates in exit and fork as well as
with the user manually writing into the set_*_pid files.

> 
> So I *think* the RCU approach should be doable and cover this.

Where would you put the synchronize_rcu()? In do_exit()?

Also understanding what this is used for helps in understanding the scope
of protection needed.

The pid_list is created when you add anything into one of the pid files in
tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid:

  # cd /sys/kernel/tracing
  # echo $$ > set_ftrace_pid
  # echo 1 > options/function-fork
  # cat set_ftrace_pid
  2716
  2936
  # cat set_ftrace_pid
  2716
  2945

What the above did was to create a pid_list for the function tracer. I
added the bash process pid using $$ (2716). Then when I cat the file, it
showed the pid for the bash process as well as the pid for the cat process,
as the cat process is a child of the bash process. The function-fork option
means to add any child process to the set_ftrace_pid if the parent is
already in the list. It also means to remove the pid if a process in the
list exits.

When I enable function tracing, it will only trace the bash process and any
of its children:

 # echo 0 > tracing_on
 # echo function > current_tracer
 # cat set_ftrace_pid ; echo 0 > tracing_on
 2716
 2989
 # cat trace
[..]
            bash-2716    [003] ..... 36854.662833: rcu_read_lock_held <-mtree_range_walk
            bash-2716    [003] ..... 36854.662834: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
            bash-2716    [003] ..... 36854.662834: rcu_read_lock_held <-vma_start_read
##### CPU 6 buffer started ####
             cat-2989    [006] d..2. 36854.662834: ret_from_fork <-ret_from_fork_asm
            bash-2716    [003] ..... 36854.662835: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
             cat-2989    [006] d..2. 36854.662836: schedule_tail <-ret_from_fork
            bash-2716    [003] ..... 36854.662836: __rcu_read_unlock <-lock_vma_under_rcu
             cat-2989    [006] d..2. 36854.662836: finish_task_switch.isra.0 <-schedule_tail
            bash-2716    [003] ..... 36854.662836: handle_mm_fault <-do_user_addr_fault
[..]

It would be way too expensive to check the pid_list at *every* function
call. But luckily we don't have to. Instead, we set a per-cpu flag in the
instance trace_array on sched_switch if the next pid is in the pid_list and
clear it if it is not. (See ftrace_filter_pid_sched_switch_probe()).

This means, the bit being checked in the pid_list is always for a task that
is about to run.

The bit being cleared, is always for that task that is exiting (except for
the case of manual updates).

What we are protecting against is when one chunk is freed, but then
allocated again for a different set of PIDs. Where the reader has the chunk,
it was freed and re-allocated and the bit that is about to be checked
doesn't represent the bit it is checking for.

    CPU1					CPU2
    ----					----

 trace_pid_list_is_set() {
   lower_chunk = upper_chunk->data[upper2];

					do_exit() {
					  trace_pid_list_clear() {
					    lower_chunk = upper_chunk->data[upper2]; // same lower_chunk
					    clear_bit(lower, lower_chunk->data);
					    if (find_first_bit(lower_chunk->data, LOWER_MAX) >= LOWER_MAX)
					      put_lower_chunk(pid_list, lower_chunk);
					    [..]
					  }
					}

					fork() {
					  trace_pid_list_set() {
					    lower_chunk = get_lower_chunk(pid_list);
					    upper_chunk->data[upper2] = lower_chunk; // upper2 is a different value here
					    set_bit(lower, lower_chunk->data);
					  }
					}

   ret = test_bit(lower, lower_chunk->data);

And if the "lower" bit matches the set_bit from CPU2, we have a false
positive. Although, this race is highly unlikely, we should still protect
against it (it could happen on a VM vCPU that was preempted in
trace_pid_list_is_set()).

-- Steve