netdev - Re: [PATCH 1/1 v3 bpf-next] bpf: increment and use correct thread iterator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 18 Dec 2020 08:53:22 -0800
From:   Yonghong Song <yhs@...com>
To:     Jonathan Lemon <jonathan.lemon@...il.com>,
        <netdev@...r.kernel.org>, <ast@...nel.org>, <daniel@...earbox.net>,
        <bpf@...r.kernel.org>
CC:     <kernel-team@...com>
Subject: Re: [PATCH 1/1 v3 bpf-next] bpf: increment and use correct thread
 iterator



On 12/11/20 9:11 AM, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@...com>
> 
> On some systems, some variant of the following splat is
> repeatedly seen.  The common factor in all traces seems
> to be the entry point to task_file_seq_next().  With the
> patch, all warnings go away.
> 
>      rcu: INFO: rcu_sched self-detected stall on CPU
>      rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
>      \x09(t=21033 jiffies g=159148529 q=223125)
>      NMI backtrace for cpu 26
>      CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
>      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
>      Call Trace:
>       <IRQ>
>       dump_stack+0x50/0x70
>       nmi_cpu_backtrace.cold.6+0x13/0x50
>       ? lapic_can_unplug_cpu.cold.30+0x40/0x40
>       nmi_trigger_cpumask_backtrace+0xba/0xca
>       rcu_dump_cpu_stacks+0x99/0xc7
>       rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
>       ? tick_sched_do_timer+0x60/0x60
>       update_process_times+0x24/0x50
>       tick_sched_timer+0x37/0x70
>       __hrtimer_run_queues+0xfe/0x270
>       hrtimer_interrupt+0xf4/0x210
>       smp_apic_timer_interrupt+0x5e/0x120
>       apic_timer_interrupt+0xf/0x20
>       </IRQ>
>      RIP: 0010:get_pid_task+0x38/0x80
>      Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 <85> d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
>      RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
>      RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
>      RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
>      RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
>      R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
>      R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
>       ? find_ge_pid+0x1b/0x20
>       task_seq_get_next+0x52/0xc0
>       task_file_seq_get_next+0x159/0x220
>       task_file_seq_next+0x4f/0xa0
>       bpf_seq_read+0x159/0x390
>       vfs_read+0x8a/0x140
>       ksys_read+0x59/0xd0
>       do_syscall_64+0x42/0x110
>       entry_SYSCALL_64_after_hwframe+0x44/0xa9
>      RIP: 0033:0x7f95ae73e76e
>      Code: Bad RIP value.
>      RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>      RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
>      RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
>      RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
>      R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
>      R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0
> 
> The attached patch does 3 things:
> 
> 1) If unable to obtain the file structure for the current task,
>     proceed to the next task number after the one returned from
>     task_seq_get_next(), instead of the next task number from the
>     original iterator.

Looks like this fix is the real fix for the above warnings.
Basically, say we have
    info->tid = 10 and returned curr_tid = 3000 and tid 3000 has no files.
the current logic will go through
    - set curr_tid = 11 (info->tid++) and returned curr_tid = 3000
    - set curr_tid = 12 and returned curr_tid = 3000
    ...
    - set curr_tid = 3000 and returned curr_tid = 3000
    - set curr_tid = 3001 and return curr_tid >= 3001

All the above works are redundant work, and it may cause issues
for non preemptable kernel.

I suggest you factor out this change plus the following change
which suggested by Andrii early to a separate patch carried with
the below Fixes tag.

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 0458a40edf10..56bcaef72e36 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -158,6 +158,7 @@ task_file_seq_get_next(struct 
bpf_iter_seq_task_file_info *info)
                 if (!curr_task) {
                         info->task = NULL;
                         info->files = NULL;
+                       info->tid = curr_tid + 1;
                         return NULL;
                 }

> 
> 2) Use thread_group_leader() instead of the open-coded comparision
>     of tgid vs pid.

My experiment show there is no difference between thread_group_leader()
and comparing tgid/pid, but indeed using existing function is better,
so I am okay with this.

> 
> 3) Only obtain the task reference count at the end of the RCU section
>     instead of repeatedly obtaining/releasing it when iterathing though
>     a thread group.

The above two changes are not really fixing the rcu warnings, but
they are nice to have indeed. Could you put it into separate patch
(patch 2)? You can add the following improvement change as well

-bash-4.4$ git diff
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 0458a40edf10..f61e5ddb38ce 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -148,12 +148,12 @@ task_file_seq_get_next(struct 
bpf_iter_seq_task_file_info *info)
          * it held a reference to the task/files_struct/file.
          * Otherwise, it does not hold any reference.
          */
-again:
         if (info->task) {
                 curr_task = info->task;
                 curr_files = info->files;
                 curr_fd = info->fd;
         } else {
+again:
                 curr_task = task_seq_get_next(ns, &curr_tid, true);
                 if (!curr_task) {
                         info->task = NULL;

to reduce one branch for searching next task in task_file_seq_get_next() 
function.

> 
> Fixes: a650da2ee52a ("bpf: Add task and task/file iterator targets")
> Fixes: 67b6b863e6ab ("bpf: Avoid iterating duplicated files for task_file iterator")
> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@...il.com>
> ---
>   kernel/bpf/task_iter.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 0458a40edf10..66a52fcf589a 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -33,17 +33,17 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns,
>   	pid = find_ge_pid(*tid, ns);
>   	if (pid) {
>   		*tid = pid_nr_ns(pid, ns);
> -		task = get_pid_task(pid, PIDTYPE_PID);
> +		task = pid_task(pid, PIDTYPE_PID);
>   		if (!task) {
>   			++*tid;
>   			goto retry;
> -		} else if (skip_if_dup_files && task->tgid != task->pid &&
> +		} else if (skip_if_dup_files && !thread_group_leader(task) &&
>   			   task->files == task->group_leader->files) {
> -			put_task_struct(task);
>   			task = NULL;
>   			++*tid;
>   			goto retry;
>   		}
> +		get_task_struct(task);
>   	}
>   	rcu_read_unlock();
>   
> @@ -164,7 +164,7 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
>   		curr_files = get_files_struct(curr_task);
>   		if (!curr_files) {
>   			put_task_struct(curr_task);
> -			curr_tid = ++(info->tid);
> +			curr_tid = curr_tid + 1;
>   			info->fd = 0;
>   			goto again;
>   		}
>