[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <05cde576-e480-46e1-1228-7f54497a6252@huaweicloud.com>
Date: Sat, 24 Jun 2023 17:12:21 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: tj@...nel.org, rcu@...r.kernel.org, netdev@...r.kernel.org,
bpf@...r.kernel.org, kernel-team@...com, daniel@...earbox.net,
andrii@...nel.org, void@...ifault.com, paulmck@...nel.org
Subject: Re: [PATCH bpf-next 11/12] bpf: Introduce bpf_mem_free_rcu() similar
to kfree_rcu().
Hi,
On 6/22/2023 11:23 AM, Alexei Starovoitov wrote:
> On Wed, Jun 21, 2023 at 08:26:30PM +0800, Hou Tao wrote:
>>>
SNIP
>>> rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
>>> + rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> I got a oops in rcu_tasks_invoke_cbs() during stressing test and it
>> seems we should do atomic_read(&call_rcu_in_progress) first, then do
>> atomic_read(&call_rcu_ttrace_in_progress) to fix the problem. And to
> yes. it's a race. As you find out yourself changing the order won't fix it.
>
>> The introduction of c->tgt make the destroy procedure more complicated.
>> Even with the proposed fix above, there is still oops in
>> rcu_tasks_invoke_cbs() and I think it happens as follows:
> Right. That's another issue.
>
> Please send bench htab test and your special stress test,
> so we can have a common ground to reason about.
> Also please share your bench htab numbers before/after.
Will send htab-mem benchmark next week and update the benchmark results
accordingly.
There is no peculiarity in my local stress test. I just hacked htab to
use bpf_mem_cache_free_rcu() and ran multiple tests_maps, map_perf_test
and htab_mem benchmark simultaneously.
>
> I'm thinking to fix the races in the following way.
> Could you please test it with your stress test?
> The idea is to set 'draining' first everywhere that it will make all rcu
> callbacks a nop.
> Then drain all link lists. At this point nothing races with them.
> And then wait for single rcu_barrier_tasks_trace() that will make sure
> all callbcaks done. At this point the only thing they will do is
> if (c->draining) goto out;
> The barriers are needed to make 'c' access not uaf.
As I replied in the v2, I don't think the lockless checking of
c->draining will work. And I will test it anyway.
>
> ...
>
> >From e20782160166d4327c76b57af160c4973396e0d0 Mon Sep 17 00:00:00 2001
> From: Alexei Starovoitov <ast@...nel.org>
> Date: Wed, 21 Jun 2023 20:11:33 -0700
> Subject: [PATCH bpf-next] bpf: Fix races.
>
> Signed-off-by: Alexei Starovoitov <ast@...nel.org>
> ---
> kernel/bpf/memalloc.c | 25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 4d1002e7b4b5..75c553b15deb 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -99,6 +99,7 @@ struct bpf_mem_cache {
> int low_watermark, high_watermark, batch;
> int percpu_size;
> struct bpf_mem_cache *tgt;
> + bool draining;
>
> /* list of objects to be freed after RCU GP */
> struct llist_head free_by_rcu;
> @@ -264,7 +265,10 @@ static void __free_rcu(struct rcu_head *head)
> {
> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
>
> + if (unlikely(c->draining))
> + goto out;
> free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
> +out:
> atomic_set(&c->call_rcu_ttrace_in_progress, 0);
> }
>
> @@ -353,8 +357,11 @@ static void __free_by_rcu(struct rcu_head *head)
> {
> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> struct bpf_mem_cache *tgt = c->tgt;
> - struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
> + struct llist_node *llnode;
>
> + if (unlikely(c->draining))
> + goto out;
> + llnode = llist_del_all(&c->waiting_for_gp);
> if (!llnode)
> goto out;
>
> @@ -568,10 +575,9 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
> * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
> * using rcu_trace_implies_rcu_gp() as well.
> */
> - rcu_barrier(); /* wait for __free_by_rcu() */
> - rcu_barrier_tasks_trace(); /* wait for __free_rcu() via call_rcu_tasks_trace */
> + rcu_barrier_tasks_trace();
> if (!rcu_trace_implies_rcu_gp())
> - rcu_barrier(); /* wait for __free_rcu() via call_rcu */
> + rcu_barrier();
> free_mem_alloc_no_barrier(ma);
> }
>
> @@ -616,6 +622,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>
> if (ma->cache) {
> rcu_in_progress = 0;
> + for_each_possible_cpu(cpu) {
> + c = per_cpu_ptr(ma->cache, cpu);
> + c->draining = true;
> + }
> for_each_possible_cpu(cpu) {
> c = per_cpu_ptr(ma->cache, cpu);
> /*
> @@ -639,6 +649,13 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
> }
> if (ma->caches) {
> rcu_in_progress = 0;
> + for_each_possible_cpu(cpu) {
> + cc = per_cpu_ptr(ma->caches, cpu);
> + for (i = 0; i < NUM_CACHES; i++) {
> + c = &cc->cache[i];
> + c->draining = true;
> + }
> + }
> for_each_possible_cpu(cpu) {
> cc = per_cpu_ptr(ma->caches, cpu);
> for (i = 0; i < NUM_CACHES; i++) {
Powered by blists - more mailing lists