[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <113dc8c1-0840-9ee3-2840-28246731604c@meta.com>
Date: Wed, 24 May 2023 12:34:31 -0700
From: Yonghong Song <yhs@...a.com>
To: Teng Qi <starmiku1207184332@...il.com>
Cc: ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com,
andrii@...nel.org, martin.lau@...ux.dev, song@...nel.org,
yhs@...com, kpsingh@...nel.org, sdf@...gle.com, haoluo@...gle.com,
jolsa@...nel.org, davem@...emloft.net, kuba@...nel.org,
hawk@...nel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in
__bpf_prog_put()
On 5/24/23 5:42 AM, Teng Qi wrote:
> Thank you.
>
>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>> value rcu_read_lock_held() could be 1 for some configurations regardless
>> whether rcu_read_lock() is really held or not. In most cases,
>> rcu_read_lock_held() is used in issuing potential warnings.
>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>
> Sorry. I was not aware of the dependency of configurations of
> rcu_read_lock_held().
>
>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>> can be !in_interrupt(), so any process-context will go to a workqueue.
>
> I agree that using !in_interrupt() as a condition is an acceptable solution.
This should work although it could be conservative.
>
>> Alternatively, we could have another solution. We could add another
>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>> will be done in rcu context.
>
> Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> more significant changes.
Maybe we can change signature of bpf_prog_put instead? Like
void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
and inside bpf_prog_put we can add
WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>
>> So if in_interrupt(), do kvfree, otherwise,
>> put into a workqueue.
>
> Shall we proceed with submitting a patch following this approach?
You could choose either of the above although I think with newer
bpf_prog_put() is better.
BTW, please do create a test case, e.g, sockmap test case which
can show the problem with existing code base.
>
> I would like to mention something unrelated to the possible bug. At this
> moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> but not safe under other atomic contexts.
> This disorder challenges our conventional belief, a monotonic incrementation
> of limitations of the hierarchical atomic contexts, that programer needs
> to be more and more careful to write code under rcu read lock, spin lock,
> bh disable, interrupt...
> This disorder can lead to unexpected consequences, such as code being safe
> under interrupts but not safe under spin locks.
> The disorder makes kernel programming more complex and may result in more bugs.
> Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> I feel sad for undermining of kernel`s maintainability and disorder of
> hierarchy of atomic contexts.
>
> -- Teng Qi
>
> On Tue, May 23, 2023 at 12:33 PM Yonghong Song <yhs@...a.com> wrote:
>>
>>
>>
>> On 5/21/23 6:39 AM, Teng Qi wrote:
>>> Thank you.
>>>
>>> > Your above analysis makes sense if indeed that kvfree cannot appear
>>> > inside a spin lock region or RCU read lock region. But is it true?
>>> > I checked a few code paths in kvfree/kfree. It is either guarded
>>> > with local_irq_save/restore or by
>>> > spin_lock_irqsave/spin_unlock_
>>> > irqrestore, etc. Did I miss
>>> > anything? Are you talking about RT kernel here?
>>>
>>> To see the sleepable possibility of kvfree, it is important to analyze the
>>> following calling stack:
>>> mm/util.c: 645 kvfree()
>>> mm/vmalloc.c: 2763 vfree()
>>>
>>> In kvfree(), to call vfree, if the pointer addr points to memory
>>> allocated by
>>> vmalloc(), it calls vfree().
>>> void kvfree(const void *addr)
>>> {
>>> if (is_vmalloc_addr(addr))
>>> vfree(addr);
>>> else
>>> kfree(addr);
>>> }
>>>
>>> In vfree(), in_interrupt() and might_sleep() need to be considered.
>>> void vfree(const void *addr)
>>> {
>>> // ...
>>> if (unlikely(in_interrupt()))
>>> {
>>> vfree_atomic(addr);
>>> return;
>>> }
>>> // ...
>>> might_sleep();
>>> // ...
>>> }
>>
>> Sorry. I didn't check vfree path. So it does look like that
>> we need to pay special attention to non interrupt part.
>>
>>>
>>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
>>> could have in_interrupt() == false and spin lock region which only disables
>>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
>>> inside a spin lock region or RCU read lock region if the pointer addr points
>>> to memory allocated by vmalloc().
>>>
>>> > > Therefore, we propose modifying the condition to include
>>> > > in_atomic(). Could we
>>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
>>> > > in_atomic()"?
>>> > Thank you! We look forward to your feedback.
>>>
>>> We now think that ‘irqs_disabled() || in_atomic() ||
>>> rcu_read_lock_held()’ is
>>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
>>> preempt count and rcu_read_lock_held() is for RCU read lock region.
>>
>> We cannot use rcu_read_lock_held() in the 'if' statement. The return
>> value rcu_read_lock_held() could be 1 for some configuraitons regardless
>> whether rcu_read_lock() is really held or not. In most cases,
>> rcu_read_lock_held() is used in issuing potential warnings.
>> Maybe there are other ways to record whether rcu_read_lock() is held or not?
>>
>> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
>> since it covers process context local_irq_save() and spin_lock() cases.
>>
>> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
>> can be !in_interrupt(), so any process-context will go to a workqueue.
>>
>> Alternatively, we could have another solution. We could add another
>> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
>> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
>> put into a workqueue.
>>
>>
>>>
>>> -- Teng Qi
>>>
>>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <yhs@...a.com
>>> <mailto:yhs@...a.com>> wrote:
>>>
>>>
>>>
>>> On 5/19/23 7:18 AM, Teng Qi wrote:
>>> > Thank you for your response.
>>> > > Looks like you only have suspicion here. Could you find a real
>>> violation
>>> > > here where __bpf_prog_put() is called with !in_irq() &&
>>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>> have not seen
>>> > > things like that.
>>> >
>>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
>>> we have
>>> > been
>>> > unable to really trigger this atomic violation after trying to
>>> construct
>>> > test cases manually. But we found that it is possible to show
>>> cases with
>>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
>>> > For example, even a failed case, one of selftest cases of bpf,
>>> netns_cookie,
>>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
>>> > only inside rcu read lock: The possible call stack is:
>>> > net/core/sock_map.c: 615 bpf_sock_map_update()
>>> > net/core/sock_map.c: 468 sock_map_update_common()
>>> > net/core/sock_map.c: 217 sock_map_link()
>>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
>>> >
>>> > The files about netns_cookie include
>>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
>>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
>>> inserted the
>>> > following code in
>>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
>>> > static int sock_map_update_common(..)
>>> > {
>>> > int inIrq = in_irq();
>>> > int irqsDisabled = irqs_disabled();
>>> > int preemptBits = preempt_count();
>>> > int inAtomic = in_atomic();
>>> > int rcuHeld = rcu_read_lock_held();
>>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
>>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
>>> irqsDisabled,
>>> > preemptBits, inAtomic, rcuHeld);
>>> > }
>>> >
>>> > The output message is as follows:
>>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
>>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
>>> > in_atomic() 0,
>>> > rcu_read_lock_held() 1
>>> > #113 netns_cookie:OK
>>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>> >
>>> > We notice that there are numerous callers in kernel/, net/ and
>>> drivers/,
>>> > so we
>>> > highly suggest modifying __bpf_prog_put() to address this gap.
>>> The gap
>>> > exists
>>> > because __bpf_prog_put() is only safe under in_irq() ||
>>> irqs_disabled()
>>> > but not in_atomic() || rcu_read_lock_held(). The following code
>>> snippet may
>>> > mislead developers into thinking that bpf_prog_put() is safe in all
>>> > contexts.
>>> > if (in_irq() || irqs_disabled()) {
>>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>> > schedule_work(&aux->work);
>>> > } else {
>>> > bpf_prog_put_deferred(&aux->work);
>>> > }
>>> >
>>> > Implicit dependency may lead to issues.
>>> >
>>> > > Any problem here?
>>> > We mentioned it to demonstrate the possibility of kvfree() being
>>> > called by __bpf_prog_put_noref().
>>> >
>>> > Thanks.
>>> > -- Teng Qi
>>> >
>>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <yhs@...a.com
>>> <mailto:yhs@...a.com>
>>> > <mailto:yhs@...a.com <mailto:yhs@...a.com>>> wrote:
>>> >
>>> >
>>> >
>>> > On 5/16/23 4:18 AM, starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>
>>> > <mailto:starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>> wrote:
>>> > > From: Teng Qi <starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>
>>> > <mailto:starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>>>
>>> > >
>>> > > Hi, bpf developers,
>>> > >
>>> > > We are developing a static tool to check the matching between
>>> > helpers and the
>>> > > context of hooks. During our analysis, we have discovered some
>>> > important
>>> > > findings that we would like to report.
>>> > >
>>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
>>> function
>>> > > bpf_prog_put_deferred() won`t be called in the condition of
>>> > > ‘in_irq() || irqs_disabled()’.
>>> > > if (in_irq() || irqs_disabled()) {
>>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
>>> > > schedule_work(&aux->work);
>>> > > } else {
>>> > >
>>> > > bpf_prog_put_deferred(&aux->work);
>>> > > }
>>> > >
>>> > > We suspect this condition exists because there might be
>>> sleepable
>>> > operations
>>> > > in the callees of the bpf_prog_put_deferred() function:
>>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
>>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
>>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
>>> > > kvfree(prog->aux->jited_linfo);
>>> > > kvfree(prog->aux->linfo);
>>> >
>>> > Looks like you only have suspicion here. Could you find a real
>>> > violation
>>> > here where __bpf_prog_put() is called with !in_irq() &&
>>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
>>> have not seen
>>> > things like that.
>>> >
>>> > >
>>> > > Additionally, we found that array prog->aux->jited_linfo is
>>> > initialized in
>>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
>>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
>>> > > sizeof(*prog->aux->jited_linfo),
>>> bpf_memcg_flags(GFP_KERNEL |
>>> > __GFP_NOWARN));
>>> >
>>> > Any problem here?
>>> >
>>> > >
>>> > > Our question is whether the condition 'in_irq() ||
>>> > irqs_disabled() == false' is
>>> > > sufficient for calling 'kvfree'. We are aware that calling
>>> > 'kvfree' within the
>>> > > context of a spin lock or an RCU lock is unsafe.
>>>
>>> Your above analysis makes sense if indeed that kvfree cannot appear
>>> inside a spin lock region or RCU read lock region. But is it true?
>>> I checked a few code paths in kvfree/kfree. It is either guarded
>>> with local_irq_save/restore or by
>>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
>>> anything? Are you talking about RT kernel here?
>>>
>>>
>>> > >
>>> > > Therefore, we propose modifying the condition to include
>>> > in_atomic(). Could we
>>> > > update the condition as follows: "in_irq() ||
>>> irqs_disabled() ||
>>> > in_atomic()"?
>>> > >
>>> > > Thank you! We look forward to your feedback.
>>> > >
>>> > > Signed-off-by: Teng Qi <starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>
>>> > <mailto:starmiku1207184332@...il.com
>>> <mailto:starmiku1207184332@...il.com>>>
>>> >
>>>
Powered by blists - more mailing lists