netdev - Re: [PATCH v2 bpf-next] bpf: increment and use correct thread iterator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad119bd4-7c92-c623-1137-56345db38c7d@fb.com>
Date:   Fri, 4 Dec 2020 10:54:45 -0800
From:   Yonghong Song <yhs@...com>
To:     Jonathan Lemon <jonathan.lemon@...il.com>
CC:     <netdev@...r.kernel.org>, <ast@...nel.org>, <daniel@...earbox.net>,
        <kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next] bpf: increment and use correct thread
 iterator



On 12/4/20 9:14 AM, Jonathan Lemon wrote:
> On Fri, Dec 04, 2020 at 12:01:53AM -0800, Yonghong Song wrote:
>>
>>
>> On 12/3/20 7:43 PM, Jonathan Lemon wrote:
>>> From: Jonathan Lemon <bsd@...com>
>>
>> Could you explain in the commit log what problem this patch
>> tries to solve? What bad things could happen without this patch?
> 
> Without the patch, on a particular set of systems, RCU will repeatedly
> generate stall warnings similar to the trace below.  The common factor
> for all the traces seems to be using task_file_seq_next().  With the
> patch, all the warnings go away.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: \x0910-....: (20666 ticks this GP) idle=4b6/1/0x4000000000000002 softirq=14346773/14346773 fqs=5064
>   \x09(t=21013 jiffies g=25395133 q=154147)
>   NMI backtrace for cpu 10
>   #1
>   Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018
>   Call Trace:
>    <IRQ>
>    dump_stack+0x50/0x70
>    nmi_cpu_backtrace.cold.6+0x13/0x50
>    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
>    nmi_trigger_cpumask_backtrace+0xba/0xca
>    rcu_dump_cpu_stacks+0x99/0xc7
>    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
>    ? tick_sched_do_timer+0x60/0x60
>    update_process_times+0x24/0x50
>    tick_sched_timer+0x37/0x70
>    __hrtimer_run_queues+0xfe/0x270
>    hrtimer_interrupt+0xf4/0x210
>    smp_apic_timer_interrupt+0x5e/0x120
>    apic_timer_interrupt+0xf/0x20
>    </IRQ>
>   RIP: 0010:find_ge_pid_upd+0x5/0x20
>   Code: 80 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 d5 d3 9a 00 48 83 c4 08 c3 0f 1f 44 00 00 <48> 89 f8 48 8d 7e 08 48 89 c6 e9 bc d3 9a 00 cc cc cc cc cc cc cc
>   RSP: 0018:ffffc9002b7abdb8 EFLAGS: 00000297 ORIG_RAX: ffffffffffffff13
>   RAX: 00000000002ca5cd RBX: ffff889c44c0ba00 RCX: 0000000000000000
>   RDX: 0000000000000002 RSI: ffffffff8284eb80 RDI: ffffc9002b7abdc4
>   RBP: ffffc9002b7abe0c R08: ffff8895afe93a00 R09: ffff8891388abb50
>   R10: 000000000000000c R11: 00000000002ca600 R12: 000000000000003f
>   R13: ffffffff8284eb80 R14: 0000000000000001 R15: 00000000ffffffff
>    task_seq_get_next+0x53/0x180
>    task_file_seq_get_next+0x159/0x220
>    task_file_seq_next+0x4f/0xa0
>    bpf_seq_read+0x159/0x390
>    vfs_read+0x8a/0x140
>    ksys_read+0x59/0xd0
>    do_syscall_64+0x42/0x110
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
>>> If unable to obtain the file structure for the current task,
>>> proceed to the next task number after the one returned from
>>> task_seq_get_next(), instead of the next task number from the
>>> original iterator.
>> This seems a correct change. The current code should still work
>> but it may do some redundant/unnecessary work in kernel.
>> This only happens when a task does not have any file,
>> no sure whether this is the culprit for the problem this
>> patch tries to address.
>>
>>>
>>> Use thread_group_leader() instead of comparing tgid vs pid, which
>>> might may be racy.
>>
>> I see
>>
>> static inline bool thread_group_leader(struct task_struct *p)
>> {
>>          return p->exit_signal >= 0;
>> }
>>
>> I am not sure whether thread_group_leader(task) is equivalent
>> to task->tgid == task->pid or not. Any documentation or explanation?
>>
>> Could you explain why task->tgid != task->pid in the original
>> code could be racy?
> 
> My understanding is that anything which uses pid_t for comparision
> in the kernel is incorrect.  Looking at de_thread(), there is a
> section which swaps the pid and tids around, but doesn't seem to
> change tgid directly.
> 
> There's also this comment in linux/pid.h:
>          /*
>           * Both old and new leaders may be attached to
>           * the same pid in the middle of de_thread().
>           */
> 
> So the safest thing to do is use the explicit thread_group_leader()
> macro rather than trying to open code things.

in de_thread(), I see p->exit_signal (used in thread_group_leader())
has race as well.

...
                 exchange_tids(tsk, leader);
                 transfer_pid(leader, tsk, PIDTYPE_TGID);
                 transfer_pid(leader, tsk, PIDTYPE_PGID);
                 transfer_pid(leader, tsk, PIDTYPE_SID);

                 list_replace_rcu(&leader->tasks, &tsk->tasks);
                 list_replace_init(&leader->sibling, &tsk->sibling);

                 tsk->group_leader = tsk;
                 leader->group_leader = tsk;

                 tsk->exit_signal = SIGCHLD;
                 leader->exit_signal = -1;

                 BUG_ON(leader->exit_state != EXIT_ZOMBIE);
                 leader->exit_state = EXIT_DEAD;
...

But I am not sure how this race or pid race could impact
task traversal differently.

> 
> 
>>> Only obtain the task reference count at the end of the RCU section
>>> instead of repeatedly obtaining/releasing it when iterathing though
>>> a thread group.
>>
>> I think this is an optimization and not about the correctness.
> 
> Yes, but the loop in question can be executed thousands of times, and
> there isn't much point in doing this needless work.  It's unclear
> whether this is a significant time contribution to the RCU stall,
> but reducing the amount of refcounting isn't a bad thing.

Sure. Totally agree.