[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c6e4aa90-aa35-fa42-1196-a71c88994620@meta.com>
Date: Sun, 11 Jun 2023 17:01:58 -0700
From: Yonghong Song <yhs@...a.com>
To: Teng Qi <starmiku1207184332@...il.com>
Cc: ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com,
andrii@...nel.org, martin.lau@...ux.dev, song@...nel.org, yhs@...com,
kpsingh@...nel.org, sdf@...gle.com, haoluo@...gle.com,
jolsa@...nel.org, davem@...emloft.net, kuba@...nel.org,
hawk@...nel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in
__bpf_prog_put()
On 6/11/23 6:02 AM, Teng Qi wrote:
> Hello!
>> BTW, please do create a test case, e.g, sockmap test case which
>> can show the problem with existing code base.
>
> I add a printk in bpf_prog_put_deferred():
> static void bpf_prog_put_deferred(struct work_struct *work)
> {
> // . . .
> int inIrq = in_irq();
> int irqsDisabled = irqs_disabled();
> int preemptBits = preempt_count();
> int inAtomic = in_atomic();
> int rcuHeld = rcu_read_lock_held();
> printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
> %d, in_atomic() %d, rcu_read_lock_held() %d",
> inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
> // . . .
> }
>
> When running the selftest, I see the following output:
> [255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
> [255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
> preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1
It would be great if you also print out in_interrupt() value, so we know
whether softirq or nmi is enabled or not.
We cannot really WARN with !rcu_read_lock_held() since the
__bpf_prog_put funciton is called in different contexts.
Also, note that rcu_read_lock_held() may not be reliable. rcu subsystem
will return 1 if not tracked or not sure about the result.
>
> Based on this output, I believe it is sufficient to construct a self-test case
> for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
> region. However, I'm a bit confused about what I should do to build the
> self-test case. Are we looking to create a checker that verifies the
> context of bpf_prog_put_deferred() is valid? Or do we need a test case that
> can trigger this bug?
>
> Could you show me more ideas to construct a self test case? I am not familiar
> with it and have no idea.
Okay, I see. It seems hard to create a test case with warnings since
bpf_prog_put_deferred is called in different context. So some
examples for possible issues (through code analysis) should be good enough.
>
> -- Teng Qi
>
> On Thu, May 25, 2023 at 3:34 AM Yonghong Song <yhs@...a.com> wrote:
>>
>>
>>
>> On 5/24/23 5:42 AM, Teng Qi wrote:
>>> Thank you.
>>>
>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>> value rcu_read_lock_held() could be 1 for some configurations regardless
>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>
>>> Sorry. I was not aware of the dependency of configurations of
>>> rcu_read_lock_held().
>>>
>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>
>>> I agree that using !in_interrupt() as a condition is an acceptable solution.
>>
>> This should work although it could be conservative.
>>
>>>
>>>> Alternatively, we could have another solution. We could add another
>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>> will be done in rcu context.
>>>
>>> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
>>> more significant changes.
>>
>> Maybe we can change signature of bpf_prog_put instead? Like
>> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
>> and inside bpf_prog_put we can add
>> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>>
>>>
>>>> So if in_interrupt(), do kvfree, otherwise,
>>>> put into a workqueue.
>>>
>>> Shall we proceed with submitting a patch following this approach?
>>
>> You could choose either of the above although I think with newer
>> bpf_prog_put() is better.
>>
>> BTW, please do create a test case, e.g, sockmap test case which
>> can show the problem with existing code base.
>>
>>>
>>> I would like to mention something unrelated to the possible bug. At this
>>> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
>>> but not safe under other atomic contexts.
>>> This disorder challenges our conventional belief, a monotonic incrementation
>>> of limitations of the hierarchical atomic contexts, that programer needs
>>> to be more and more careful to write code under rcu read lock, spin lock,
>>> bh disable, interrupt...
>>> This disorder can lead to unexpected consequences, such as code being safe
>>> under interrupts but not safe under spin locks.
>>> The disorder makes kernel programming more complex and may result in more bugs.
>>> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
>>> I feel sad for undermining of kernel`s maintainability and disorder of
>>> hierarchy of atomic contexts.
>>>
>>> -- Teng Qi
>>>
>>> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <yhs@...a.com> wrote:
>>>>
>>>>
>>>>
>>>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>>>> Thank you.
>>>>>
>>>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>>>> > inside a spin lock region or RCU read lock region. But is it true?
>>>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>>>> > with local_irq_save/restore or by
>>>>> > spin_lock_irqsave/spin_unlock_
>>>>> > irqrestore, etc. Did I miss
>>>>> > anything? Are you talking about RT kernel here?
>>>>>
>>>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>>>> following calling stack:
>>>>> mm/util.c: 645 kvfree()
>>>>> mm/vmalloc.c: 2763 vfree()
>>>>>
>>>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>>>> allocated by
>>>>> vmalloc(), it calls vfree().
>>>>> void kvfree(const void *addr)
>>>>> {
>>>>> if (is_vmalloc_addr(addr))
>>>>> vfree(addr);
>>>>> else
>>>>> kfree(addr);
>>>>> }
>>>>>
>>>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>>>> void vfree(const void *addr)
>>>>> {
>>>>> // ...
>>>>> if (unlikely(in_interrupt()))
>>>>> {
>>>>> vfree_atomic(addr);
>>>>> return;
>>>>> }
>>>>> // ...
>>>>> might_sleep();
>>>>> // ...
>>>>> }
>>>>
>>>> Sorry. I didn't check vfree path. So it does look like that
>>>> we need to pay special attention to non interrupt part.
>>>>
>>>>>
>>>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>>>> could have in_interrupt() == false and spin lock region which only disables
>>>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>>>> to memory allocated by vmalloc().
>>>>>
>>>>> > > Therefore, we propose modifying the condition to include
>>>>> > > in_atomic(). Could we
>>>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>>>> > > in_atomic()"?
>>>>> > Thank you! We look forward to your feedback.
>>>>>
>>>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>>>> rcu_read_lock_held()’ is
>>>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>>>
>>>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>>>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>>>> whether rcu_read_lock() is really held or not. In most cases,
>>>> rcu_read_lock_held() is used in issuing potential warnings.
>>>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>>>
>>>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>>>> since it covers process context local_irq_save() and spin_lock() cases.
>>>>
>>>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>>>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>>>
>>>> Alternatively, we could have another solution. We could add another
>>>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>>>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>>>> put into a workqueue.
>>>>
>>>>
>>>>>
>>>>> -- Teng Qi
>>>>>
>>>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <yhs@...a.com
>>>>> <mailto:yhs@...a.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>>>> > Thank you for your response.
>>>>> > > Looks like you only have suspicion here. Could you find a real
>>>>> violation
>>>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>> have not seen
>>>>> > > things like that.
>>>>> >
>>>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>>>> we have
>>>>> > been
>>>>> > unable to really trigger this atomic violation after trying to
>>>>> construct
>>>>> > test cases manually. But we found that it is possible to show
>>>>> cases with
>>>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>>>> > For example, even a failed case, one of selftest cases of bpf,
>>>>> netns_cookie,
>>>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>>>> > only inside rcu read lock: The possible call stack is:
>>>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>>>> > net/core/sock_map.c: 217 sock_map_link()
>>>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>>>> >
>>>>> > The files about netns_cookie include
>>>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>>>> inserted the
>>>>> > following code in
>>>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>>>> > static int sock_map_update_common(..)
>>>>> > {
>>>>> > int inIrq = in_irq();
>>>>> > int irqsDisabled = irqs_disabled();
>>>>> > int preemptBits = preempt_count();
>>>>> > int inAtomic = in_atomic();
>>>>> > int rcuHeld = rcu_read_lock_held();
>>>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>>>> irqsDisabled,
>>>>> > preemptBits, inAtomic, rcuHeld);
>>>>> > }
>>>>> >
>>>>> > The output message is as follows:
>>>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>>>> > in_atomic() 0,
>>>>> > rcu_read_lock_held() 1
>>>>> > #113 netns_cookie:OK
>>>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>>>> >
>>>>> > We notice that there are numerous callers in kernel/, net/ and
>>>>> drivers/,
>>>>> > so we
>>>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>>>> The gap
>>>>> > exists
>>>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>>>> irqs_disabled()
>>>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>>>> snippet may
>>>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>>>> > contexts.
>>>>> > if (in_irq() || irqs_disabled()) {
>>>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>> > schedule_work(&aux->work);
>>>>> > } else {
>>>>> > bpf_prog_put_deferred(&aux->work);
>>>>> > }
>>>>> >
>>>>> > Implicit dependency may lead to issues.
>>>>> >
>>>>> > > Any problem here?
>>>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>>>> > called by __bpf_prog_put_noref().
>>>>> >
>>>>> > Thanks.
>>>>> > -- Teng Qi
>>>>> >
>>>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <yhs@...a.com
>>>>> <mailto:yhs@...a.com>
>>>>> > <mailto:yhs@...a.com <mailto:yhs@...a.com>>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> > On 5/16/23 4:18 AM, starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>
>>>>> > <mailto:starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>> wrote:
>>>>> > > From: Teng Qi <starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>
>>>>> > <mailto:starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>>>
>>>>> > >
>>>>> > > Hi, bpf developers,
>>>>> > >
>>>>> > > We are developing a static tool to check the matching between
>>>>> > helpers and the
>>>>> > > context of hooks. During our analysis, we have discovered some
>>>>> > important
>>>>> > > findings that we would like to report.
>>>>> > >
>>>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>>>> function
>>>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>>>> > > ‘in_irq() || irqs_disabled()’.
>>>>> > > if (in_irq() || irqs_disabled()) {
>>>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>>>> > > schedule_work(&aux->work);
>>>>> > > } else {
>>>>> > >
>>>>> > > bpf_prog_put_deferred(&aux->work);
>>>>> > > }
>>>>> > >
>>>>> > > We suspect this condition exists because there might be
>>>>> sleepable
>>>>> > operations
>>>>> > > in the callees of the bpf_prog_put_deferred() function:
>>>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>>>> > > kvfree(prog->aux->jited_linfo);
>>>>> > > kvfree(prog->aux->linfo);
>>>>> >
>>>>> > Looks like you only have suspicion here. Could you find a real
>>>>> > violation
>>>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>>>> have not seen
>>>>> > things like that.
>>>>> >
>>>>> > >
>>>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>>>> > initialized in
>>>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>>>> > > sizeof(*prog->aux->jited_linfo),
>>>>> bpf_memcg_flags(GFP_KERNEL |
>>>>> > __GFP_NOWARN));
>>>>> >
>>>>> > Any problem here?
>>>>> >
>>>>> > >
>>>>> > > Our question is whether the condition 'in_irq() ||
>>>>> > irqs_disabled() == false' is
>>>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>>>> > 'kvfree' within the
>>>>> > > context of a spin lock or an RCU lock is unsafe.
>>>>>
>>>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>>>> inside a spin lock region or RCU read lock region. But is it true?
>>>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>>>> with local_irq_save/restore or by
>>>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>>>> anything? Are you talking about RT kernel here?
>>>>>
>>>>>
>>>>> > >
>>>>> > > Therefore, we propose modifying the condition to include
>>>>> > in_atomic(). Could we
>>>>> > > update the condition as follows: "in_irq() ||
>>>>> irqs_disabled() ||
>>>>> > in_atomic()"?
>>>>> > >
>>>>> > > Thank you! We look forward to your feedback.
>>>>> > >
>>>>> > > Signed-off-by: Teng Qi <starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>
>>>>> > <mailto:starmiku1207184332@...il.com
>>>>> <mailto:starmiku1207184332@...il.com>>>
>>>>> >
>>>>>
Powered by blists - more mailing lists