[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJxhyqDCFRT_fPWHb67x-PUu+Om91UrbrQEifcF7m+dkZ35dqA@mail.gmail.com>
Date: Wed, 12 Nov 2025 13:27:10 +0800
From: Yongliang Gao <leonylgao@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>, mhiramat@...nel.org,
mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, Yongliang Gao <leonylgao@...cent.com>,
Huang Cun <cunhuang@...cent.com>, frankjpliu@...cent.com
Subject: Re: [PATCH] trace/pid_list: optimize pid_list->lock contention
Hi Steven,
Thank you for your detailed response and the proposed RCU-like approach.
I've looked into using a regular seqlock instead of the current
implementation, but as you pointed out, the write side is indeed a
critical path. More importantly, I found that even with seqlock, the
write_seqlock() function internally uses spin_lock() which on
PREEMPT_RT gets converted to an mutex. This would cause the same
issues we're trying to avoid - potential sleep in atomic contexts.
bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
{
union upper_chunk *upper_chunk;
union lower_chunk *lower_chunk;
unsigned int seq;
unsigned long flags;
unsigned int upper1;
unsigned int upper2;
unsigned int lower;
bool ret = false;
if (!pid_list)
return false;
if (pid_split(pid, &upper1, &upper2, &lower) < 0)
return false;
do {
local_irq_save(flags);
seq = read_seqbegin(&pid_list->lock);
ret = false;
upper_chunk = pid_list->upper[upper1];
if (upper_chunk) {
lower_chunk = upper_chunk->data[upper2];
if (lower_chunk)
ret = test_bit(lower, lower_chunk->data);
}
local_irq_restore(flags);
} while (read_seqretry(&pid_list->lock, seq));
return ret;
}
BUG: sleeping function called from invalid context at
kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 192, name: bash
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
CPU: 3 UID: 0 PID: 192 Comm: bash Not tainted 6.18.0-rc5+ #84 PREEMPT_{RT,LAZY}
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
Call Trace:
<TASK>
dump_stack_lvl+0x4f/0x70
__might_resched+0x113/0x160
rt_spin_lock+0x41/0x130
trace_pid_list_set+0x52/0x150
ftrace_pid_follow_sched_process_fork+0x19/0x30
kernel_clone+0x1b8/0x3e0
__do_sys_clone+0x65/0x90
do_syscall_64+0x48/0xa60
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Your proposed solution using atomic counters and memory barriers
should provide the lock-free read path we need while maintaining code
correctness. However, to achieve this correctly requires careful
consideration of all memory ordering scenarios, and I'm concerned
about introducing subtle bugs given the complexity.
Would you be willing to submit a patch implementing your approach?
On Tue, Nov 11, 2025 at 11:27 PM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Tue, 11 Nov 2025 09:13:14 +0100
> Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
>
> > Nope, no read-write lock that can be used in atomic sections. Well,
> > there is RCU.
>
> Well, it can't simply be replaced by RCU as the write side is also a
> critical path. It happens when new tasks are spawned.
>
> Now we could possibly do some RCU like magic, and remove the lock in the
> read, but it would need some care with the writes.
>
> Something like this (untested):
>
> bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
> {
> union upper_chunk *upper_chunk;
> union lower_chunk *lower_chunk;
> unsigned long flags;
> unsigned int upper1;
> unsigned int upper2;
> unsigned int lower;
> bool ret = false;
>
> if (!pid_list)
> return false;
>
> if (pid_split(pid, &upper1, &upper2, &lower) < 0)
> return false;
>
> upper_chunk = READ_ONCE(pid_list->upper[upper1]);
> if (upper_chunk) {
> lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
> if (lower_chunk)
> ret = test_bit(lower, lower_chunk->data);
> }
>
> return ret;
> }
>
> Now when all the bits of a chunk is cleared, it goes to a free-list. And
> when a new chunk is needed, it acquires it from that free-list. We need to
> make sure that the chunk acquired in the read hasn't gone through the
> free-list.
>
> Now we could have an atomic counter in the pid_list and make this more of a
> seqcount? That is, have the counter updated when a chunk goes to the free
> list and also when it is taken from the free list. We could then make this:
>
> again:
> counter = atomic_read(&pid_list->counter);
> smp_rmb();
> upper_chunk = READ_ONCE(pid_list->upper[upper1]);
> if (upper_chunk) {
> lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
> if (lower_chunk) {
> ret = test_bit(lower, lower_chunk->data);
> smp_rmb();
> if (unlikely(counter != atomic_read(&pid_list->counter))) {
> ret = false;
> goto again;
> }
> }
> }
>
>
> And in the set we need:
>
> upper_chunk = pid_list->upper[upper1];
> if (!upper_chunk) {
> upper_chunk = get_upper_chunk(pid_list);
> if (!upper_chunk) {
> ret = -ENOMEM;
> goto out;
> }
> atomic_inc(&pid_list->counter);
> smp_wmb();
> WRITE_ONCE(pid_list->upper[upper1], upper_chunk);
> }
> lower_chunk = upper_chunk->data[upper2];
> if (!lower_chunk) {
> lower_chunk = get_lower_chunk(pid_list);
> if (!lower_chunk) {
> ret = -ENOMEM;
> goto out;
> }
> atomic_inc(&pid_list->counter);
> smp_wmb();
> WRITE_ONCE(upper_chunk->data[upper2], lower_chunk);
> }
>
> and in the clear:
>
> if (find_first_bit(lower_chunk->data, LOWER_MAX) >= LOWER_MAX) {
> put_lower_chunk(pid_list, lower_chunk);
> WRITE_ONCE(upper_chunk->data[upper2], NULL);
> smp_wmb();
> atomic_inc(&pid_list->counter);
> if (upper_empty(upper_chunk)) {
> put_upper_chunk(pid_list, upper_chunk);
> WRITE_ONCE(pid_list->upper[upper1], NULL);
> smp_wmb();
> atomic_inc(&pid_list->counter);
> }
> }
>
> That is, the counter gets updated after setting the chunk to NULL and
> before assigning it a new value. And reading it, the counter is read before
> looking at any of the chunks, and tested after getting the result. If the
> value is the same, then the chunks are for the correct PID and haven't
> swapped in a free/alloc swap where it's looking at a chunk for a different
> PID.
>
> This would allow for the read to not take any locks.
>
> -- Steve
Powered by blists - more mailing lists