linux-kernel - Re: [PATCH v3] trace/pid_list: optimize pid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251113151729.4Zky6d-t@linutronix.de>
Date: Thu, 13 Nov 2025 16:17:29 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Yongliang Gao <leonylgao@...il.com>, mhiramat@...nel.org,
	mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, frankjpliu@...cent.com,
	Yongliang Gao <leonylgao@...cent.com>,
	Huang Cun <cunhuang@...cent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention

On 2025-11-13 10:05:24 [-0500], Steven Rostedt wrote:
> This means that the chunks are not being freed and we can't be doing
> synchronize_rcu() in every exit.

You don't have to, you can do call_rcu().

> > Additionally it would guarantee that the buffer is not released in
> > trace_pid_list_free(). I don't see how the seqcount ensures that the
> > buffer is not gone. I mean you could have a reader and the retry would
> > force you to do another loop but before that happens you dereference the
> > upper_chunk pointer which could be reused.
> 
> This protection has nothing to do with trace_pid_list_free(). In fact,
> you'll notice that function doesn't even have any locking. That's because
> the pid_list itself is removed from view and RCU synchronization happens
> before that function is called.
> 
> The protection in trace_pid_list_is_set() is only to synchronize with the
> adding and removing of the bits in the updates in exit and fork as well as
> with the user manually writing into the set_*_pid files.

So if the kfree() is not an issue, it is just the use of the block from
the freelist which must not point to a wrong item? And therefore the
seqcount?

> > So I *think* the RCU approach should be doable and cover this.
> 
> Where would you put the synchronize_rcu()? In do_exit()?

simply call_rcu() and let it move to the freelist.

> Also understanding what this is used for helps in understanding the scope
> of protection needed.
> 
> The pid_list is created when you add anything into one of the pid files in
> tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid:
> 
>   # cd /sys/kernel/tracing
>   # echo $$ > set_ftrace_pid
>   # echo 1 > options/function-fork
>   # cat set_ftrace_pid
>   2716
>   2936
>   # cat set_ftrace_pid
>   2716
>   2945
> 
> What the above did was to create a pid_list for the function tracer. I
> added the bash process pid using $$ (2716). Then when I cat the file, it
> showed the pid for the bash process as well as the pid for the cat process,
> as the cat process is a child of the bash process. The function-fork option
> means to add any child process to the set_ftrace_pid if the parent is
> already in the list. It also means to remove the pid if a process in the
> list exits.

This adding/ add-on-fork, removing and remove-on-exit is the only write
side?

> When I enable function tracing, it will only trace the bash process and any
> of its children:
> 
>  # echo 0 > tracing_on
>  # echo function > current_tracer
>  # cat set_ftrace_pid ; echo 0 > tracing_on
>  2716
>  2989
>  # cat trace
> [..]


>             bash-2716    [003] ..... 36854.662833: rcu_read_lock_held <-mtree_range_walk
>             bash-2716    [003] ..... 36854.662834: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
>             bash-2716    [003] ..... 36854.662834: rcu_read_lock_held <-vma_start_read
> ##### CPU 6 buffer started ####
>              cat-2989    [006] d..2. 36854.662834: ret_from_fork <-ret_from_fork_asm
>             bash-2716    [003] ..... 36854.662835: rcu_lockdep_current_cpu_online <-rcu_read_lock_held
>              cat-2989    [006] d..2. 36854.662836: schedule_tail <-ret_from_fork
>             bash-2716    [003] ..... 36854.662836: __rcu_read_unlock <-lock_vma_under_rcu
>              cat-2989    [006] d..2. 36854.662836: finish_task_switch.isra.0 <-schedule_tail
>             bash-2716    [003] ..... 36854.662836: handle_mm_fault <-do_user_addr_fault
> [..]
> 
> It would be way too expensive to check the pid_list at *every* function
> call. But luckily we don't have to. Instead, we set a per-cpu flag in the
> instance trace_array on sched_switch if the next pid is in the pid_list and
> clear it if it is not. (See ftrace_filter_pid_sched_switch_probe()).
> 
> This means, the bit being checked in the pid_list is always for a task that
> is about to run.
> 
> The bit being cleared, is always for that task that is exiting (except for
> the case of manual updates).
> 
> What we are protecting against is when one chunk is freed, but then
> allocated again for a different set of PIDs. Where the reader has the chunk,
> it was freed and re-allocated and the bit that is about to be checked
> doesn't represent the bit it is checking for.

This I assumed.
And the kfree() at the end can not happen while there is still a reader?

…
> And if the "lower" bit matches the set_bit from CPU2, we have a false
> positive. Although, this race is highly unlikely, we should still protect
> against it (it could happen on a VM vCPU that was preempted in
> trace_pid_list_is_set()).
> 
> -- Steve

Sebastian