linux-kernel - Re: [PATCH] trace/pid_list: optimize pid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJxhyqDCFRT_fPWHb67x-PUu+Om91UrbrQEifcF7m+dkZ35dqA@mail.gmail.com>
Date: Wed, 12 Nov 2025 13:27:10 +0800
From: Yongliang Gao <leonylgao@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>, mhiramat@...nel.org, 
	mathieu.desnoyers@...icios.com, linux-kernel@...r.kernel.org, 
	linux-trace-kernel@...r.kernel.org, Yongliang Gao <leonylgao@...cent.com>, 
	Huang Cun <cunhuang@...cent.com>, frankjpliu@...cent.com
Subject: Re: [PATCH] trace/pid_list: optimize pid_list->lock contention

Hi Steven,

Thank you for your detailed response and the proposed RCU-like approach.

I've looked into using a regular seqlock instead of the current
implementation, but as you pointed out, the write side is indeed a
critical path. More importantly, I found that even with seqlock, the
write_seqlock() function internally uses spin_lock() which on
PREEMPT_RT gets converted to an mutex. This would cause the same
issues we're trying to avoid - potential sleep in atomic contexts.

bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
{
    union upper_chunk *upper_chunk;
    union lower_chunk *lower_chunk;
    unsigned int seq;
    unsigned long flags;
    unsigned int upper1;
    unsigned int upper2;
    unsigned int lower;
    bool ret = false;

    if (!pid_list)
        return false;

    if (pid_split(pid, &upper1, &upper2, &lower) < 0)
        return false;

    do {
        local_irq_save(flags);
        seq = read_seqbegin(&pid_list->lock);
        ret = false;
        upper_chunk = pid_list->upper[upper1];
        if (upper_chunk) {
            lower_chunk = upper_chunk->data[upper2];
            if (lower_chunk)
                ret = test_bit(lower, lower_chunk->data);
        }
        local_irq_restore(flags);
    } while (read_seqretry(&pid_list->lock, seq));

    return ret;
}

 BUG: sleeping function called from invalid context at
kernel/locking/spinlock_rt.c:48
 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 192, name: bash
 preempt_count: 1, expected: 0
 RCU nest depth: 0, expected: 0
 CPU: 3 UID: 0 PID: 192 Comm: bash Not tainted 6.18.0-rc5+ #84 PREEMPT_{RT,LAZY}
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 Call Trace:
  <TASK>
  dump_stack_lvl+0x4f/0x70
  __might_resched+0x113/0x160
  rt_spin_lock+0x41/0x130
  trace_pid_list_set+0x52/0x150
  ftrace_pid_follow_sched_process_fork+0x19/0x30
  kernel_clone+0x1b8/0x3e0
  __do_sys_clone+0x65/0x90
  do_syscall_64+0x48/0xa60
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Your proposed solution using atomic counters and memory barriers
should provide the lock-free read path we need while maintaining code
correctness. However, to achieve this correctly requires careful
consideration of all memory ordering scenarios, and I'm concerned
about introducing subtle bugs given the complexity.

Would you be willing to submit a patch implementing your approach?

On Tue, Nov 11, 2025 at 11:27 PM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Tue, 11 Nov 2025 09:13:14 +0100
> Sebastian Andrzej Siewior <bigeasy@...utronix.de> wrote:
>
> > Nope, no read-write lock that can be used in atomic sections. Well,
> > there is RCU.
>
> Well, it can't simply be replaced by RCU as the write side is also a
> critical path. It happens when new tasks are spawned.
>
> Now we could possibly do some RCU like magic, and remove the lock in the
> read, but it would need some care with the writes.
>
> Something like this (untested):
>
> bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
> {
>         union upper_chunk *upper_chunk;
>         union lower_chunk *lower_chunk;
>         unsigned long flags;
>         unsigned int upper1;
>         unsigned int upper2;
>         unsigned int lower;
>         bool ret = false;
>
>         if (!pid_list)
>                 return false;
>
>         if (pid_split(pid, &upper1, &upper2, &lower) < 0)
>                 return false;
>
>         upper_chunk = READ_ONCE(pid_list->upper[upper1]);
>         if (upper_chunk) {
>                 lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
>                 if (lower_chunk)
>                         ret = test_bit(lower, lower_chunk->data);
>         }
>
>         return ret;
> }
>
> Now when all the bits of a chunk is cleared, it goes to a free-list. And
> when a new chunk is needed, it acquires it from that free-list. We need to
> make sure that the chunk acquired in the read hasn't gone through the
> free-list.
>
> Now we could have an atomic counter in the pid_list and make this more of a
> seqcount? That is, have the counter updated when a chunk goes to the free
> list and also when it is taken from the free list. We could then make this:
>
>  again:
>         counter = atomic_read(&pid_list->counter);
>         smp_rmb();
>         upper_chunk = READ_ONCE(pid_list->upper[upper1]);
>         if (upper_chunk) {
>                 lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
>                 if (lower_chunk) {
>                         ret = test_bit(lower, lower_chunk->data);
>                         smp_rmb();
>                         if (unlikely(counter != atomic_read(&pid_list->counter))) {
>                                 ret = false;
>                                 goto again;
>                         }
>                 }
>         }
>
>
> And in the set we need:
>
>         upper_chunk = pid_list->upper[upper1];
>         if (!upper_chunk) {
>                 upper_chunk = get_upper_chunk(pid_list);
>                 if (!upper_chunk) {
>                         ret = -ENOMEM;
>                         goto out;
>                 }
>                 atomic_inc(&pid_list->counter);
>                 smp_wmb();
>                 WRITE_ONCE(pid_list->upper[upper1], upper_chunk);
>         }
>         lower_chunk = upper_chunk->data[upper2];
>         if (!lower_chunk) {
>                 lower_chunk = get_lower_chunk(pid_list);
>                 if (!lower_chunk) {
>                         ret = -ENOMEM;
>                         goto out;
>                 }
>                 atomic_inc(&pid_list->counter);
>                 smp_wmb();
>                 WRITE_ONCE(upper_chunk->data[upper2], lower_chunk);
>         }
>
> and in the clear:
>
>         if (find_first_bit(lower_chunk->data, LOWER_MAX) >= LOWER_MAX) {
>                 put_lower_chunk(pid_list, lower_chunk);
>                 WRITE_ONCE(upper_chunk->data[upper2], NULL);
>                 smp_wmb();
>                 atomic_inc(&pid_list->counter);
>                 if (upper_empty(upper_chunk)) {
>                         put_upper_chunk(pid_list, upper_chunk);
>                         WRITE_ONCE(pid_list->upper[upper1], NULL);
>                         smp_wmb();
>                         atomic_inc(&pid_list->counter);
>                 }
>         }
>
> That is, the counter gets updated after setting the chunk to NULL and
> before assigning it a new value. And reading it, the counter is read before
> looking at any of the chunks, and tested after getting the result. If the
> value is the same, then the chunks are for the correct PID and haven't
> swapped in a free/alloc swap where it's looking at a chunk for a different
> PID.
>
> This would allow for the read to not take any locks.
>
> -- Steve