[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJq+NA0denwyr56jYz73n5BnkKF_GtY0zpwqsSvCrGs6Q@mail.gmail.com>
Date: Wed, 28 Jun 2023 20:52:12 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
David Vernet <void@...ifault.com>, Hou Tao <houtao@...weicloud.com>, Tejun Heo <tj@...nel.org>,
rcu@...r.kernel.org, Network Development <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v3 bpf-next 12/13] bpf: Introduce bpf_mem_free_rcu()
similar to kfree_rcu().
On Wed, Jun 28, 2023 at 10:57 AM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Tue, Jun 27, 2023 at 06:56:33PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@...nel.org>
> >
> > Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu().
> > Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into
> > per-cpu free list the _rcu() flavor waits for RCU grace period and then moves
> > objects into free_by_rcu_ttrace list where they are waiting for RCU
> > task trace grace period to be freed into slab.
> >
> > The life cycle of objects:
> > alloc: dequeue free_llist
> > free: enqeueu free_llist
> > free_rcu: enqueue free_by_rcu -> waiting_for_gp
> > free_llist above high watermark -> free_by_rcu_ttrace
> > after RCU GP waiting_for_gp -> free_by_rcu_ttrace
> > free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab
> >
> > Signed-off-by: Alexei Starovoitov <ast@...nel.org>
> > ---
> > include/linux/bpf_mem_alloc.h | 2 +
> > kernel/bpf/memalloc.c | 129 +++++++++++++++++++++++++++++++++-
> > 2 files changed, 128 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h
> > index 3929be5743f4..d644bbb298af 100644
> > --- a/include/linux/bpf_mem_alloc.h
> > +++ b/include/linux/bpf_mem_alloc.h
> > @@ -27,10 +27,12 @@ void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma);
> > /* kmalloc/kfree equivalent: */
> > void *bpf_mem_alloc(struct bpf_mem_alloc *ma, size_t size);
> > void bpf_mem_free(struct bpf_mem_alloc *ma, void *ptr);
> > +void bpf_mem_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
> >
> > /* kmem_cache_alloc/free equivalent: */
> > void *bpf_mem_cache_alloc(struct bpf_mem_alloc *ma);
> > void bpf_mem_cache_free(struct bpf_mem_alloc *ma, void *ptr);
> > +void bpf_mem_cache_free_rcu(struct bpf_mem_alloc *ma, void *ptr);
> > void bpf_mem_cache_raw_free(void *ptr);
> > void *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags);
> >
> > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
> > index 40524d9454c7..3081d06a434c 100644
> > --- a/kernel/bpf/memalloc.c
> > +++ b/kernel/bpf/memalloc.c
> > @@ -101,6 +101,15 @@ struct bpf_mem_cache {
> > bool draining;
> > struct bpf_mem_cache *tgt;
> >
> > + /* list of objects to be freed after RCU GP */
> > + struct llist_head free_by_rcu;
> > + struct llist_node *free_by_rcu_tail;
> > + struct llist_head waiting_for_gp;
> > + struct llist_node *waiting_for_gp_tail;
> > + struct rcu_head rcu;
> > + atomic_t call_rcu_in_progress;
> > + struct llist_head free_llist_extra_rcu;
> > +
> > /* list of objects to be freed after RCU tasks trace GP */
> > struct llist_head free_by_rcu_ttrace;
> > struct llist_head waiting_for_gp_ttrace;
> > @@ -344,6 +353,69 @@ static void free_bulk(struct bpf_mem_cache *c)
> > do_call_rcu_ttrace(tgt);
> > }
> >
> > +static void __free_by_rcu(struct rcu_head *head)
> > +{
> > + struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
> > + struct bpf_mem_cache *tgt = c->tgt;
> > + struct llist_node *llnode;
> > +
> > + llnode = llist_del_all(&c->waiting_for_gp);
> > + if (!llnode)
> > + goto out;
> > +
> > + llist_add_batch(llnode, c->waiting_for_gp_tail, &tgt->free_by_rcu_ttrace);
> > +
> > + /* Objects went through regular RCU GP. Send them to RCU tasks trace */
> > + do_call_rcu_ttrace(tgt);
> > +out:
> > + atomic_set(&c->call_rcu_in_progress, 0);
> > +}
> > +
> > +static void check_free_by_rcu(struct bpf_mem_cache *c)
> > +{
> > + struct llist_node *llnode, *t;
> > + unsigned long flags;
> > +
> > + /* drain free_llist_extra_rcu */
> > + if (unlikely(!llist_empty(&c->free_llist_extra_rcu))) {
> > + inc_active(c, &flags);
> > + llist_for_each_safe(llnode, t, llist_del_all(&c->free_llist_extra_rcu))
> > + if (__llist_add(llnode, &c->free_by_rcu))
> > + c->free_by_rcu_tail = llnode;
> > + dec_active(c, flags);
> > + }
> > +
> > + if (llist_empty(&c->free_by_rcu))
> > + return;
> > +
> > + if (atomic_xchg(&c->call_rcu_in_progress, 1)) {
> > + /*
> > + * Instead of kmalloc-ing new rcu_head and triggering 10k
> > + * call_rcu() to hit rcutree.qhimark and force RCU to notice
> > + * the overload just ask RCU to hurry up. There could be many
> > + * objects in free_by_rcu list.
> > + * This hint reduces memory consumption for an artifical
> > + * benchmark from 2 Gbyte to 150 Mbyte.
> > + */
> > + rcu_request_urgent_qs_task(current);
>
> I have been going back and forth on whether rcu_request_urgent_qs_task()
> needs to throttle calls to itself, for example, to pay attention to only
> one invocation per jiffy. The theory here is that RCU's state machine
> normally only advances about once per jiffy anyway.
>
> The main risk of *not* throttling is if several CPUs were to invoke
> rcu_request_urgent_qs_task() in tight loops while those same CPUs were
> undergoing interrupt storms, which would result in heavy lock contention
> in __rcu_irq_enter_check_tick(). This is not exactly a common-case
> scenario, but on the other hand, if you are having this degree of trouble,
> should RCU really be adding lock contention to your troubles?
I see spin_lock in __rcu_irq_enter_check_tick(), but I didn't observe
it in practice even when I was calling rcu_request_urgent_qs_task()
in multiple places through bpf_mem_alloc.
I left it only in one place (this patch),
because it was enough to 'hurry up the RCU' and make the difference.
rdp = this_cpu_ptr(&rcu_data); is percpu, so I'm not sure why
you think that the contention is possible.
I think we should avoid extra logic either in RCU or in bpf_mem_alloc
to keep the code simple, since contention is hypothetical at this point.
I've tried preempt and no preempt configs. With and without debug.
Powered by blists - more mailing lists