[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210616165250.ejtcvgip5q5hbacy@ast-mbp.dhcp.thefacebook.com>
Date: Wed, 16 Jun 2021 09:52:50 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Yonghong Song <yhs@...com>,
"David S. Miller" <davem@...emloft.net>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Network Development <netdev@...r.kernel.org>,
bpf <bpf@...r.kernel.org>, Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 1/3] bpf: Introduce bpf_timer
On Tue, Jun 15, 2021 at 10:54:40PM -0700, Andrii Nakryiko wrote:
>
> It could be the case, of course. But let's try to think this through
> to the end before giving up. I think it's mostly because we are trying
> to be too clever with lockless synchronization.
imo your proposed code fits "too clever" too ;)
Just a reminder that few emails ago you've argued
about "obviously correct" approach, but now...
> I had a feeling that bpf_timer_cb needs to take lock as well. But once
> we add that, refcounting becomes simpler and more deterministic, IMO.
> Here's what I have in mind. I keep only important parts of the code,
> so it's not a complete implementation. Please take a look below, I
> left a few comments here and there.
>
>
> struct bpf_hrtimer {
> struct hrtimer timer;
> struct bpf_map *map;
> void *value;
>
> struct bpf_prog *prog;
> void *callback_fn;
>
> /* pointer to that lock in struct bpf_timer_kern
> * so that we can access it from bpf_timer_cb()
> */
> struct bpf_spin_lock *lock;
> };
>
> BPF_CALL_5(bpf_timer_init, struct bpf_timer_kern *, timer, int, flags,
> struct bpf_map *, map)
> {
> struct bpf_hrtimer *t;
> int ret = 0;
>
> ____bpf_spin_lock(&timer->lock);
> t = timer->timer;
> if (t) {
> ret = -EBUSY;
> goto out;
> }
> /* allocate hrtimer via map_kmalloc to use memcg accounting */
> t = bpf_map_kmalloc_node(map, sizeof(*t), GFP_ATOMIC, NUMA_NO_NODE);
> if (!t) {
> ret = -ENOMEM;
> goto out;
> }
> t->value = (void *)timer /* - offset of bpf_timer inside elem */;
> t->map = map;
> t->timer.function = bpf_timer_cb;
>
> /* we'll init them in bpf_timer_start */
> t->prog = NULL;
> t->callback_fn = NULL;
>
> hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> timer->timer = t;
> out:
> ____bpf_spin_unlock(&timer->lock);
> return ret;
> }
>
>
> BPF_CALL_2(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs,
> void *, cb, struct bpf_prog *, prog)
> {
> struct bpf_hrtimer *t;
> int ret = 0;
>
> ____bpf_spin_lock(&timer->lock);
> t = timer->timer;
> if (!t) {
> ret = -EINVAL;
> goto out;
> }
>
> /* doesn't matter what it returns, we just request cancellation */
> hrtimer_try_to_cancel(&t->timer);
>
> /* t->prog might not be the same as prog (!) */
> if (prog != t->prog) {
> /* callback hasn't yet dropped refcnt */
> if (t->prog) /* if it's null bpf_timer_cb() is running and
> will put it later */
> bpf_prog_put(t->prog);
>
> if (IS_ERR(bpf_prog_inc_not_zero(prog))) {
> /* this will only happen if prog is still running (and
> it's actually us),
> * but it was already put to zero, e.g., by closing last FD,
> * so there is no point in scheduling a new run
> */
I have a bit of mind explosion here... everything will be alright.
> t->prog = NULL;
> t->callback_fn = NULL;
> ret = -E_WE_ARE_SHUTTING_DOWN;
> goto out;
> }
> } /* otherwise we keep existing refcnt on t->prog == prog */
>
> /* potentially new combination of prog and cb */
> t->prog = prog;
> t->callback_fn = cb;
>
> hrtimer_start(&t->timer, ns_to_ktime(nsecs), HRTIMER_MODE_REL_SOFT);
> out:
> ____bpf_spin_unlock(&timer->lock);
> return ret;
> }
>
> BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> {
> struct bpf_hrtimer *t;
> int ret = 0;
>
> ____bpf_spin_lock(&timer->lock);
> t = timer->timer;
> if (!t) {
> ret = -EINVAL;
> goto out;
> }
>
> /* this part I still worry about due to possibility of cpu migration,
> * we need to think if we should migrate_disable() in bpf_timer_cb()
> * and bpf_timer_* helpers(), but that's a separate topic
> */
> if (this_cpu_read(hrtimer_running) == t) {
> ret = -EDEADLK;
> goto out;
> }
>
> ret = hrtimer_cancel(&t->timer);
>
> if (t->prog) {
> /* bpf_timer_cb hasn't put it yet (and now won't) */
> bpf_prog_put(t->prog);
> t->prog = NULL;
> t->callback_fn = NULL;
> }
> out:
> ____bpf_spin_unlock(&timer->lock);
> return ret;
> }
>
> static enum hrtimer_restart bpf_timer_cb(struct hrtimer *timer)
> {
> struct bpf_hrtimer *t = container_of(timer, struct bpf_hrtimer, timer);
> struct bpf_map *map = t->map;
> struct bpf_prog *prog;
> void *key, *callback_fn;
> u32 idx;
> int ret;
>
> /* this is very IMPORTANT */
> ____bpf_spin_lock(t->lock);
>
> prog = t->prog;
> if (!prog) {
> /* we were cancelled, prog is put already, exit early */
> ____bpf_spin_unlock(&timer->lock);
> return HRTIMER_NORESTART;
> }
> callback_fn = t->callback_fn;
>
> /* make sure bpf_timer_cancel/bpf_timer_start won't
> bpf_prog_put our prog */
> t->prog = NULL;
> t->callback_fn = NULL;
>
> ____bpf_spin_unlock(t->lock);
>
> /* at this point we "own" prog's refcnt decrement */
>
> this_cpu_write(hrtimer_running, t);
>
> ...
>
> ret = BPF_CAST_CALL(t->callback_fn)((u64)(long)map,
> (u64)(long)key,
> (u64)(long)value, 0, 0);
> WARN_ON(ret != 0); /* Next patch disallows 1 in the verifier */
>
> bpf_prog_put(prog); /* always correct and non-racy */
>
> this_cpu_write(hrtimer_running, NULL);
>
> return HRTIMER_NORESTART;
> }
>
> bpf_timer_cancel_and_free() is mostly the same with t->prog NULL check
> as everywhere else
I haven't started detailed analysis of above proposal, but looks overly
complicated on the first glance. Not saying it's bad or good.
Just complexity and races are striking.
>
> > There is no need to complicate bpf_timer with crazy refcnting schemes.
> > The user space can simply pin the program in bpffs. In the future we might
> > introduce a self-pinning helper that would pin the program and create a file.
> > Sort-of like syscall prog type would pin self.
> > That would be explicit and clean api instead of obscure hacks inside bpf_timer*().
>
> Do I understand correctly that the alternative that you are proposing
> is to keep some linked list of all map_values across all maps in the
> system that have initialized bpf_hrtimer with that particular bpf_prog
> in them? And when bpf_prog is put to zero you'll go and destruct them
> all in a race-free way?
>
> I have a bit of a hard time imagining how that will be implemented
> exactly, so I might be overcomplicating that in my mind. Will be happy
> to see the working code.
Here is working code...
Note how patch 1 is so much simpler without complicated refcnting.
And how patch 2 removes for_each_map_element that was necessary earlier.
Also note that link list approach is an optimization.
Instead of keeping a link list the bpf_free_used_timers() could call
a map specific op to iterate all elems and free timers with
timer->prog == prog_going_away.
That was my initial proposal couple month ago.
link_list is purely an optimization instead of for_each_map_elem.
View attachment "0001-bpf-Cancel-and-free-timers-when-prog-is-going-away.patch" of type "text/plain" (7319 bytes)
View attachment "0002-bpf-Don-t-iterate-all-map-elements-anymore.patch" of type "text/plain" (1773 bytes)
Powered by blists - more mailing lists