[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45df5090-870a-ac49-3442-24ed67266d0e@fb.com>
Date: Tue, 9 Mar 2021 10:36:06 -0800
From: Yonghong Song <yhs@...com>
To: Roman Gushchin <guro@...com>
CC: Jiri Olsa <jolsa@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
YiFei Zhu <zhuyifei@...gle.com>,
Andrii Nakryiko <andriin@...com>, <netdev@...r.kernel.org>,
<bpf@...r.kernel.org>, Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...omium.org>,
Toke Høiland-Jørgensen <toke@...hat.com>,
Yauheni Kaliuta <ykaliuta@...hat.com>,
Jiri Benc <jbenc@...hat.com>, Hangbin Liu <haliu@...hat.com>,
Stanislav Fomichev <sdf@...gle.com>
Subject: Re: [BUG] hitting bug when running spinlock test
On 3/9/21 10:21 AM, Roman Gushchin wrote:
> On Mon, Mar 08, 2021 at 09:44:08PM -0800, Yonghong Song wrote:
>>
>>
>> On 3/5/21 1:10 PM, Yonghong Song wrote:
>>>
>>>
>>> On 3/5/21 12:38 PM, Roman Gushchin wrote:
>>>> On Thu, Mar 04, 2021 at 08:03:33PM +0100, Jiri Olsa wrote:
>>>>> hi,
>>>>> I'm getting attached BUG/crash when running in parralel selftests, like:
>>>>>
>>>>> while :; do ./test_progs -t spinlock; done
>>>>> while :; do ./test_progs ; done
>>>>>
>>>>> it's the latest bpf-next/master, I can send the .config if needed,
>>>>> but I don't think there's anything special about it, because I saw
>>>>> the bug on other servers with different generic configs
>>>>>
>>>>> it looks like it's related to cgroup local storage, for some reason
>>>>> the storage deref returns NULL
>>>>>
>>>>> I'm bit lost in this code, so any help would be great ;-)
>>>>
>>>> Hi!
>>>>
>>>> I think the patch to blame is df1a2cb7c74b ("bpf/test_run: fix
>>>> unkillable BPF_PROG_TEST_RUN").
>>>
>>> Thanks, Roman, I did some experiments and found the reason of NULL
>>> storage deref is because a tracing program (mostly like a kprobe) is run
>>> after bpf_cgroup_storage_set() is called but before bpf program calls
>>> bpf_get_local_storage(). Note that trace_call_bpf() macro
>>> BPF_PROG_RUN_ARRAY_CHECK does call bpf_cgroup_storage_set().
>>>
>>>> Prior to it, we were running the test program in the
>>>> preempt_disable() && rcu_read_lock()
>>>> section:
>>>>
>>>> preempt_disable();
>>>> rcu_read_lock();
>>>> bpf_cgroup_storage_set(storage);
>>>> ret = BPF_PROG_RUN(prog, ctx);
>>>> rcu_read_unlock();
>>>> preempt_enable();
>>>>
>>>> So, a percpu variable with a cgroup local storage pointer couldn't
>>>> go away.
>>>
>>> I think even with using preempt_disable(), if the bpf program calls map
>>> lookup and there is a kprobe bpf on function htab_map_lookup_elem(), we
>>> will have the issue as BPF_PROG_RUN_ARRAY_CHECK will call
>>> bpf_cgroup_storage_set() too. I need to write a test case to confirm
>>> this though.
>>>
>>>>
>>>> After df1a2cb7c74b we can temporarily enable the preemption, so
>>>> nothing prevents
>>>> another program to call into bpf_cgroup_storage_set() on the same cpu.
>>>> I guess it's exactly what happens here.
>>>
>>> It is. I confirmed.
>>
>> Actually, the failure is not due to breaking up preempt_disable(). Even with
>> adding cond_resched(), bpf_cgroup_storage_set() still happens
>> inside the preempt region. So it is okay. What I confirmed is that
>> changing migration_{disable/enable} to preempt_{disable/enable} fixed
>> the issue.
>
> Hm, how so? If preemption is enabled, another task/bpf program can start
> executing on the same cpu and set their cgroup storage. I guess it's harder
> to reproduce or it will result in the (bpf map) memory corruption instead
> of a panic, but I don't think it's safe.
The code has been refactored recently. The following is the code right
before refactoring to make it easy to understand:
rcu_read_lock();
migrate_disable();
time_start = ktime_get_ns();
for (i = 0; i < repeat; i++) {
bpf_cgroup_storage_set(storage);
if (xdp)
*retval = bpf_prog_run_xdp(prog, ctx);
else
*retval = BPF_PROG_RUN(prog, ctx);
if (signal_pending(current)) {
ret = -EINTR;
break;
}
if (need_resched()) {
time_spent += ktime_get_ns() - time_start;
migrate_enable();
rcu_read_unlock();
cond_resched();
rcu_read_lock();
migrate_disable();
time_start = ktime_get_ns();
}
}
time_spent += ktime_get_ns() - time_start;
migrate_enable();
rcu_read_unlock();
bpf_cgroup_storage_set() is called inside migration_disable/enable().
Previously it is called inside preempt_disable/enable(), so it should be
fine.
>
>>
>> So migration_{disable/enable} is the issue since any other process (and its
>> bpf program) and preempted the current process/bpf program and run.
>
> Oh, I didn't know about the preempt_{disable/enable}/migration_{disable/enable}
> change. It's definitely not safe from a cgroup local storage perspective.
>
>> Currently for each program, we will set the local storage before the
>> program run and each program may access to multiple local storage
>> maps. So we might need something similar sk_local_storage.
>> Considering possibility of deep nested migration_{disable/enable},
>> the cgroup local storage has to be preserved in prog/map data
>> structure and not as a percpu variable as it will be very hard
>> to save and restore percpu virable content as migration can
>> happen anywhere. I don't have concrete design yet. Just throw some
>> idea here.
>
> Initially I thought about saving this pointer on stack, but then we need
> some sort of gcc/assembly magic to get this pointer from the stack outside
> of the current scope. At that time we didn't have sleepable programs,
> so the percpu approach looked simpler and more reliable. Maybe it's time
> to review it.
Indeed this is the time.
>
>>
>> BTW, I send a patch to prevent tracing programs to mess up
>> with cgroup local storage:
>> https://lore.kernel.org/bpf/20210309052638.400562-1-yhs@fb.com/T/#u
>> we now all programs access cgroup local storage should be in
>> process context and we don't need to worry about kprobe-induced
>> percpu local storage access.
>
> Thank you! My only issue is that the commit log looks like an optimization
> (like we're calling for set_cgroup_storage() for no good reason), where if
> I understand it correctly, it prevents some class of problems.
Yes, it prevents real problems as well. The reason I did not say it is
because the patch does not really fix fundamental issue. But it does
prevent some issues. Let me reword the commit message.
>
> Thanks!
>
>>
>>>
>>>>
>>>> One option to fix it is to make bpf_cgroup_storage_set() to return
>>>> the old value,
>>>> save it on a local variable and restore after the execution of the
>>>> program.
>>>
>>> In this particular case, we are doing bpf_test_run, we explicitly
>>> allocate storage and call bpf_cgroup_storage_set() right before
>>> each BPF_PROG_RUN.
>>>
>>>> But I didn't follow closely the development of sleepable bpf
>>>> programs, so I could
>>>> easily miss something.
>>>
>>> Yes, sleepable bpf program is another complication. I think we need a
>>> variable similar to bpf_prog_active, which should not nested bpf program
>>> execution for those bpf programs having local_storage map.
>>> Will try to craft some patch to facilitate the discussion.
>>>
>> [...]
Powered by blists - more mailing lists