lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 9 Mar 2021 10:36:06 -0800
From:   Yonghong Song <yhs@...com>
To:     Roman Gushchin <guro@...com>
CC:     Jiri Olsa <jolsa@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        YiFei Zhu <zhuyifei@...gle.com>,
        Andrii Nakryiko <andriin@...com>, <netdev@...r.kernel.org>,
        <bpf@...r.kernel.org>, Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...omium.org>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Yauheni Kaliuta <ykaliuta@...hat.com>,
        Jiri Benc <jbenc@...hat.com>, Hangbin Liu <haliu@...hat.com>,
        Stanislav Fomichev <sdf@...gle.com>
Subject: Re: [BUG] hitting bug when running spinlock test



On 3/9/21 10:21 AM, Roman Gushchin wrote:
> On Mon, Mar 08, 2021 at 09:44:08PM -0800, Yonghong Song wrote:
>>
>>
>> On 3/5/21 1:10 PM, Yonghong Song wrote:
>>>
>>>
>>> On 3/5/21 12:38 PM, Roman Gushchin wrote:
>>>> On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
>>>>> hi,
>>>>> I'm getting attached BUG/crash when running in parralel selftests, like:
>>>>>
>>>>>     while :; do ./test_progs -t spinlock; done
>>>>>     while :; do ./test_progs ; done
>>>>>
>>>>> it's the latest bpf-next/master, I can send the .config if needed,
>>>>> but I don't think there's anything special about it, because I saw
>>>>> the bug on other servers with different generic configs
>>>>>
>>>>> it looks like it's related to cgroup local storage, for some reason
>>>>> the storage deref returns NULL
>>>>>
>>>>> I'm bit lost in this code, so any help would be great ;-)
>>>>
>>>> Hi!
>>>>
>>>> I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
>>>> unkillable BPF_PROG_TEST_RUN").
>>>
>>> Thanks, Roman, I did some experiments and found the reason of NULL
>>> storage deref is because a tracing program (mostly like a kprobe) is run
>>> after bpf_cgroup_storage_set() is called but before bpf program calls
>>> bpf_get_local_storage(). Note that trace_call_bpf() macro
>>> BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
>>>
>>>> Prior to it, we were running the test program in the
>>>> preempt_disable() && rcu_read_lock()
>>>> section:
>>>>
>>>> preempt_disable();
>>>> rcu_read_lock();
>>>> bpf_cgroup_storage_set(storage);
>>>> ret = BPF_PROG_RUN(prog, ctx);
>>>> rcu_read_unlock();
>>>> preempt_enable();
>>>>
>>>> So, a percpu variable with a cgroup local storage pointer couldn't
>>>> go away.
>>>
>>> I think even with using preempt_disable(), if the bpf program calls map
>>> lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
>>> will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
>>> bpf_cgroup_storage_set() too. I need to write a test case to confirm
>>> this though.
>>>
>>>>
>>>> After df1a2cb7c74b we can temporarily enable the preemption, so
>>>> nothing prevents
>>>> another program to call into bpf_cgroup_storage_set() on the same cpu.
>>>> I guess it's exactly what happens here.
>>>
>>> It is. I confirmed.
>>
>> Actually, the failure is not due to breaking up preempt_disable(). Even with
>> adding cond_resched(), bpf_cgroup_storage_set() still happens
>> inside the preempt region. So it is okay. What I confirmed is that
>> changing migration_{disable/enable} to preempt_{disable/enable} fixed
>> the issue.
> 
> Hm, how so? If preemption is enabled, another task/bpf program can start
> executing on the same cpu and set their cgroup storage. I guess it's harder
> to reproduce or it will result in the (bpf map) memory corruption instead
> of a panic, but I don't think it's safe.

The code has been refactored recently. The following is the code right 
before refactoring to make it easy to understand:

         rcu_read_lock();
         migrate_disable();
         time_start = ktime_get_ns();
         for (i = 0; i < repeat; i++) {
                 bpf_cgroup_storage_set(storage);

                 if (xdp)
                         *retval = bpf_prog_run_xdp(prog, ctx);
                 else
                         *retval = BPF_PROG_RUN(prog, ctx);

                 if (signal_pending(current)) {
                         ret = -EINTR;
                         break;
                 }

                 if (need_resched()) {
                         time_spent += ktime_get_ns() - time_start;
                         migrate_enable();
                         rcu_read_unlock();

                         cond_resched();

                         rcu_read_lock();
                         migrate_disable();
                         time_start = ktime_get_ns();
                 }
         }
         time_spent += ktime_get_ns() - time_start;
         migrate_enable();
         rcu_read_unlock();

bpf_cgroup_storage_set() is called inside migration_disable/enable().
Previously it is called inside preempt_disable/enable(), so it should be 
fine.

> 
>>
>> So migration_{disable/enable} is the issue since any other process (and its
>> bpf program) and preempted the current process/bpf program and run.
> 
> Oh, I didn't know about the preempt_{disable/enable}/migration_{disable/enable}
> change. It's definitely not safe from a cgroup local storage perspective.
> 
>> Currently for each program, we will set the local storage before the
>> program run and each program may access to multiple local storage
>> maps. So we might need something similar sk_local_storage.
>> Considering possibility of deep nested migration_{disable/enable},
>> the cgroup local storage has to be preserved in prog/map data
>> structure and not as a percpu variable as it will be very hard
>> to save and restore percpu virable content as migration can
>> happen anywhere. I don't have concrete design yet. Just throw some
>> idea here.
> 
> Initially I thought about saving this pointer on stack, but then we need
> some sort of gcc/assembly magic to get this pointer from the stack outside
> of the current scope. At that time we didn't have sleepable programs,
> so the percpu approach looked simpler and more reliable. Maybe it's time
> to review it.

Indeed this is the time.

> 
>>
>> BTW, I send a patch to prevent tracing programs to mess up
>> with cgroup local storage:
>>     https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@fb.com/T/#u
>> we now all programs access cgroup local storage should be in
>> process context and we don't need to worry about kprobe-induced
>> percpu local storage access.
> 
> Thank you! My only issue is that the commit log looks like an optimization
> (like we're calling for set_cgroup_storage() for no good reason), where if
> I understand it correctly, it prevents some class of problems.

Yes, it prevents real problems as well. The reason I did not say it is 
because the patch does not really fix fundamental issue. But it does
prevent some issues. Let me reword the commit message.

> 
> Thanks!
> 
>>
>>>
>>>>
>>>> One option to fix it is to make bpf_cgroup_storage_set() to return
>>>> the old value,
>>>> save it on a local variable and restore after the execution of the
>>>> program.
>>>
>>> In this particular case, we are doing bpf_test_run, we explicitly
>>> allocate storage and call bpf_cgroup_storage_set() right before
>>> each BPF_PROG_RUN.
>>>
>>>> But I didn't follow closely the development of sleepable bpf
>>>> programs, so I could
>>>> easily miss something.
>>>
>>> Yes, sleepable bpf program is another complication. I think we need a
>>> variable similar to bpf_prog_active, which should not nested bpf program
>>> execution for those bpf programs having local_storage map.
>>> Will try to craft some patch to facilitate the discussion.
>>>
>> [...]

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ