linux-kernel - Re: [PATCH v3] trace/pid_list: optimize pid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251113141515.iZSIDK0T@linutronix.de>
Date: Thu, 13 Nov 2025 15:15:15 +0100
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Yongliang Gao <leonylgao@...il.com>
Cc: rostedt@...dmis.org, mhiramat@...nel.org,
	mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, frankjpliu@...cent.com,
	Yongliang Gao <leonylgao@...cent.com>,
	Huang Cun <cunhuang@...cent.com>
Subject: Re: [PATCH v3] trace/pid_list: optimize pid_list->lock contention

On 2025-11-13 19:13:23 [+0800], Yongliang Gao wrote:
> Hi Sebastian,
Hi Yongliang,

> Thank you for your review and the thoughtful questions.
> 
> 1. Performance Data
> We encountered this issue in a production environment with 288 cores
> where enabling set_ftrace_pid caused system CPU usage (sys%) to
> increase from 10% to over 90%. In our 92-core VM test environment:
> 
> Before patch (spinlock):
> - Without filtering: cs=2395401/s, sys%=7%
> - With filtering: cs=1828261/s, sys%=40%
> 
> After patch (seqlock):
> - Without filtering: cs=2397032/s, sys%=6%
> - With filtering: cs=2398922/s, sys%=6%
> 
> The seqlock approach eliminates the pid_list->lock contention that was
> previously causing sys% to increase from 7% to 40%.
> 
> 2. Reader Retry Behavior
> Yes, if the write side is continuously busy, the reader might spin and
> retry. However, in practice:
> - Writes are infrequent (only when setting ftrace_pid filter or during
> task fork/exit with function-fork enabled)
> - For readers, trace_pid_list_is_set() is called on every task switch,
> which can occur at a very high frequency.

See.

> 3. Result Accuracy
> You're correct that the result might change immediately after the
> read. For trace_ignore_this_task(), we don't require absolute
> accuracy. Slight race conditions (where a task might be traced or not
> in borderline cases) are acceptable.

I don't see why RCU work shouldn't work here.
If a pid is removed then it might result that a chunk cleared/ removed
then upper_chunk/ lower_chunk can become NULL. The buffer itself can be
reused and point to something else. It could lear to a false outcome in
test_bit(). This is handled by read_seqcount_retry().

You could assign upper1, upper2, to NULL/ buffer as now and the removal
(in put_lower_chunk(), put_upper_chunk()) delay to RCU so it happens
after the grace period. That would allow you to iterate over it in
trace_pid_list_is_set() lockless since the buffer will not disappear and
will not be reused for another task until after all reader left.

Additionally it would guarantee that the buffer is not released in
trace_pid_list_free(). I don't see how the seqcount ensures that the
buffer is not gone. I mean you could have a reader and the retry would
force you to do another loop but before that happens you dereference the
upper_chunk pointer which could be reused.

So I *think* the RCU approach should be doable and cover this.

> Best regards,
> Yongliang

Sebastian