netdev - Re: [PATCH bpf-next 11/12] bpf: Introduce bpf_mem_free_rcu() similar to kfree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <05cde576-e480-46e1-1228-7f54497a6252@huaweicloud.com>
Date: Sat, 24 Jun 2023 17:12:21 +0800
From: Hou Tao <houtao@...weicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: tj@...nel.org, rcu@...r.kernel.org, netdev@...r.kernel.org,
 bpf@...r.kernel.org, kernel-team@...com, daniel@...earbox.net,
 andrii@...nel.org, void@...ifault.com, paulmck@...nel.org
Subject: Re: [PATCH bpf-next 11/12] bpf: Introduce bpf_mem_free_rcu() similar
 to kfree_rcu().

Hi,

On 6/22/2023 11:23 AM, Alexei Starovoitov wrote:
> On Wed, Jun 21, 2023 at 08:26:30PM +0800, Hou Tao wrote:
>>>  	
SNIP
>>>  				rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress);
>>> +				rcu_in_progress += atomic_read(&c->call_rcu_in_progress);
>> I got a oops in rcu_tasks_invoke_cbs() during stressing test and it
>> seems we should do atomic_read(&call_rcu_in_progress) first, then do
>> atomic_read(&call_rcu_ttrace_in_progress) to fix the problem. And to
> yes. it's a race. As you find out yourself changing the order won't fix it.
>
>> The introduction of c->tgt make the destroy procedure more complicated.
>> Even with the proposed fix above, there is still oops in
>> rcu_tasks_invoke_cbs() and I think it happens as follows:
> Right. That's another issue.
>
> Please send bench htab test and your special stress test,
> so we can have a common ground to reason about.
> Also please share your bench htab numbers before/after.
Will send htab-mem benchmark next week and update the benchmark results
accordingly.

There is no peculiarity in my local stress test. I just hacked htab to
use bpf_mem_cache_free_rcu() and ran multiple tests_maps, map_perf_test
and htab_mem benchmark simultaneously.
>
> I'm thinking to fix the races in the following way.
> Could you please test it with your stress test?
> The idea is to set 'draining' first everywhere that it will make all rcu
> callbacks a nop.
> Then drain all link lists. At this point nothing races with them.
> And then wait for single rcu_barrier_tasks_trace() that will make sure
> all callbcaks done. At this point the only thing they will do is
> if (c->draining) goto out;
> The barriers are needed to make 'c' access not uaf.
As I replied in the v2, I don't think the lockless checking of
c->draining will work. And I will test it anyway.
>
> ...
>
> >From e20782160166d4327c76b57af160c4973396e0d0 Mon Sep 17 00:00:00 2001
> From: Alexei Starovoitov <ast@...nel.org>
> Date: Wed, 21 Jun 2023 20:11:33 -0700
> Subject: [PATCH bpf-next] bpf: Fix races.
>
> Signed-off-by: Alexei Starovoitov <ast@...nel.org>
> ---
>  kernel/bpf/memalloc.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> index 4d1002e7b4b5..75c553b15deb 100644
> --- a/kernel/bpf/memalloc.c
> +++ b/kernel/bpf/memalloc.c
> @@ -99,6 +99,7 @@ struct bpf_mem_cache {
>  	int low_watermark, high_watermark, batch;
>  	int percpu_size;
>  	struct bpf_mem_cache *tgt;
> +	bool draining;
>  
>  	/* list of objects to be freed after RCU GP */
>  	struct llist_head free_by_rcu;
> @@ -264,7 +265,10 @@ static void __free_rcu(struct rcu_head *head)
>  {
>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu_ttrace);
>  
> +	if (unlikely(c->draining))
> +		goto out;
>  	free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size);
> +out:
>  	atomic_set(&c->call_rcu_ttrace_in_progress, 0);
>  }
>  
> @@ -353,8 +357,11 @@ static void __free_by_rcu(struct rcu_head *head)
>  {
>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>  	struct bpf_mem_cache *tgt = c->tgt;
> -	struct llist_node *llnode = llist_del_all(&c->waiting_for_gp);
> +	struct llist_node *llnode;
>  
> +	if (unlikely(c->draining))
> +		goto out;
> +	llnode = llist_del_all(&c->waiting_for_gp);
>  	if (!llnode)
>  		goto out;
>  
> @@ -568,10 +575,9 @@ static void free_mem_alloc(struct bpf_mem_alloc *ma)
>  	 * rcu_trace_implies_rcu_gp(), it will be OK to skip rcu_barrier() by
>  	 * using rcu_trace_implies_rcu_gp() as well.
>  	 */
> -	rcu_barrier(); /* wait for __free_by_rcu() */
> -	rcu_barrier_tasks_trace(); /* wait for __free_rcu() via call_rcu_tasks_trace */
> +	rcu_barrier_tasks_trace();
>  	if (!rcu_trace_implies_rcu_gp())
> -		rcu_barrier(); /* wait for __free_rcu() via call_rcu */
> +		rcu_barrier();
>  	free_mem_alloc_no_barrier(ma);
>  }
>  
> @@ -616,6 +622,10 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>  
>  	if (ma->cache) {
>  		rcu_in_progress = 0;
> +		for_each_possible_cpu(cpu) {
> +			c = per_cpu_ptr(ma->cache, cpu);
> +			c->draining = true;
> +		}
>  		for_each_possible_cpu(cpu) {
>  			c = per_cpu_ptr(ma->cache, cpu);
>  			/*
> @@ -639,6 +649,13 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma)
>  	}
>  	if (ma->caches) {
>  		rcu_in_progress = 0;
> +		for_each_possible_cpu(cpu) {
> +			cc = per_cpu_ptr(ma->caches, cpu);
> +			for (i = 0; i < NUM_CACHES; i++) {
> +				c = &cc->cache[i];
> +				c->draining = true;
> +			}
> +		}
>  		for_each_possible_cpu(cpu) {
>  			cc = per_cpu_ptr(ma->caches, cpu);
>  			for (i = 0; i < NUM_CACHES; i++) {