netdev - Re: [BUG] hitting bug when running spinlock test

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YEfLnk3ipJ7z7bMw@carbon.dhcp.thefacebook.com>
Date:   Tue, 9 Mar 2021 11:25:18 -0800
From:   Roman Gushchin <guro@...com>
To:     Yonghong Song <yhs@...com>
CC:     Jiri Olsa <jolsa@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        YiFei Zhu <zhuyifei@...gle.com>,
        Andrii Nakryiko <andriin@...com>, <netdev@...r.kernel.org>,
        <bpf@...r.kernel.org>, Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Yauheni Kaliuta <ykaliuta@...hat.com>,
        Jiri Benc <jbenc@...hat.com>, Hangbin Liu <haliu@...hat.com>,
        Stanislav Fomichev <sdf@...gle.com>
Subject: Re: [BUG] hitting bug when running spinlock test

On Tue, Mar 09, 2021 at 10:36:06AM -0800, Yonghong Song wrote:
> 
> 
> On 3/9/21 10:21 AM, Roman Gushchin wrote:
> > On Mon, Mar 08, 2021 at 09:44:08PM -0800, Yonghong Song wrote:
> > > 
> > > 
> > > On 3/5/21 1:10 PM, Yonghong Song wrote:
> > > > 
> > > > 
> > > > On 3/5/21 12:38 PM, Roman Gushchin wrote:
> > > > > On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
> > > > > > hi,
> > > > > > I'm getting attached BUG/crash when running in parralel selftests, like:
> > > > > > 
> > > > > >     while :; do ./test_progs -t spinlock; done
> > > > > >     while :; do ./test_progs ; done
> > > > > > 
> > > > > > it's the latest bpf-next/master, I can send the .config if needed,
> > > > > > but I don't think there's anything special about it, because I saw
> > > > > > the bug on other servers with different generic configs
> > > > > > 
> > > > > > it looks like it's related to cgroup local storage, for some reason
> > > > > > the storage deref returns NULL
> > > > > > 
> > > > > > I'm bit lost in this code, so any help would be great ;-)
> > > > > 
> > > > > Hi!
> > > > > 
> > > > > I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
> > > > > unkillable BPF_PROG_TEST_RUN").
> > > > 
> > > > Thanks, Roman, I did some experiments and found the reason of NULL
> > > > storage deref is because a tracing program (mostly like a kprobe) is run
> > > > after bpf_cgroup_storage_set() is called but before bpf program calls
> > > > bpf_get_local_storage(). Note that trace_call_bpf() macro
> > > > BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
> > > > 
> > > > > Prior to it, we were running the test program in the
> > > > > preempt_disable() && rcu_read_lock()
> > > > > section:
> > > > > 
> > > > > preempt_disable();
> > > > > rcu_read_lock();
> > > > > bpf_cgroup_storage_set(storage);
> > > > > ret = BPF_PROG_RUN(prog, ctx);
> > > > > rcu_read_unlock();
> > > > > preempt_enable();
> > > > > 
> > > > > So, a percpu variable with a cgroup local storage pointer couldn't
> > > > > go away.
> > > > 
> > > > I think even with using preempt_disable(), if the bpf program calls map
> > > > lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
> > > > will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
> > > > bpf_cgroup_storage_set() too. I need to write a test case to confirm
> > > > this though.
> > > > 
> > > > > 
> > > > > After df1a2cb7c74b we can temporarily enable the preemption, so
> > > > > nothing prevents
> > > > > another program to call into bpf_cgroup_storage_set() on the same cpu.
> > > > > I guess it's exactly what happens here.
> > > > 
> > > > It is. I confirmed.
> > > 
> > > Actually, the failure is not due to breaking up preempt_disable(). Even with
> > > adding cond_resched(), bpf_cgroup_storage_set() still happens
> > > inside the preempt region. So it is okay. What I confirmed is that
> > > changing migration_{disable/enable} to preempt_{disable/enable} fixed
> > > the issue.
> > 
> > Hm, how so? If preemption is enabled, another task/bpf program can start
> > executing on the same cpu and set their cgroup storage. I guess it's harder
> > to reproduce or it will result in the (bpf map) memory corruption instead
> > of a panic, but I don't think it's safe.
> 
> The code has been refactored recently. The following is the code right
> before refactoring to make it easy to understand:
> 
>         rcu_read_lock();
>         migrate_disable();
>         time_start = ktime_get_ns();
>         for (i = 0; i < repeat; i++) {
>                 bpf_cgroup_storage_set(storage);
> 
>                 if (xdp)
>                         *retval = bpf_prog_run_xdp(prog, ctx);
>                 else
>                         *retval = BPF_PROG_RUN(prog, ctx);
> 
>                 if (signal_pending(current)) {
>                         ret = -EINTR;
>                         break;
>                 }
> 
>                 if (need_resched()) {
>                         time_spent += ktime_get_ns() - time_start;
>                         migrate_enable();
>                         rcu_read_unlock();
> 
>                         cond_resched();
> 
>                         rcu_read_lock();
>                         migrate_disable();
>                         time_start = ktime_get_ns();
>                 }
>         }
>         time_spent += ktime_get_ns() - time_start;
>         migrate_enable();
>         rcu_read_unlock();
> 
> bpf_cgroup_storage_set() is called inside migration_disable/enable().
> Previously it is called inside preempt_disable/enable(), so it should be
> fine.

Ah, gotcha, thank you for the explanation!

> 
> > 
> > > 
> > > So migration_{disable/enable} is the issue since any other process (and its
> > > bpf program) and preempted the current process/bpf program and run.
> > 
> > Oh, I didn't know about the preempt_{disable/enable}/migration_{disable/enable}
> > change. It's definitely not safe from a cgroup local storage perspective.
> > 
> > > Currently for each program, we will set the local storage before the
> > > program run and each program may access to multiple local storage
> > > maps. So we might need something similar sk_local_storage.
> > > Considering possibility of deep nested migration_{disable/enable},
> > > the cgroup local storage has to be preserved in prog/map data
> > > structure and not as a percpu variable as it will be very hard
> > > to save and restore percpu virable content as migration can
> > > happen anywhere. I don't have concrete design yet. Just throw some
> > > idea here.
> > 
> > Initially I thought about saving this pointer on stack, but then we need
> > some sort of gcc/assembly magic to get this pointer from the stack outside
> > of the current scope. At that time we didn't have sleepable programs,
> > so the percpu approach looked simpler and more reliable. Maybe it's time
> > to review it.
> 
> Indeed this is the time.
> 
> > 
> > > 
> > > BTW, I send a patch to prevent tracing programs to mess up
> > > with cgroup local storage:
> > >     https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@fb.com/T/#u
> > > we now all programs access cgroup local storage should be in
> > > process context and we don't need to worry about kprobe-induced
> > > percpu local storage access.
> > 
> > Thank you! My only issue is that the commit log looks like an optimization
> > (like we're calling for set_cgroup_storage() for no good reason), where if
> > I understand it correctly, it prevents some class of problems.
> 
> Yes, it prevents real problems as well. The reason I did not say it is
> because the patch does not really fix fundamental issue. But it does
> prevent some issues. Let me reword the commit message.

Thank you!