netdev - Re: [RFC Patch bpf-next] bpf: introduce bpf timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM_iQpWBrxuT=Y3CbhxYpE5a+QSk-O=Vj4euegggXAAKTHRBqw@mail.gmail.com>
Date:   Sat, 8 May 2021 22:37:42 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>,
        Xiongchun Duan <duanxiongchun@...edance.com>,
        Dongdong Wang <wangdongdong.6@...edance.com>,
        Muchun Song <songmuchun@...edance.com>,
        Cong Wang <cong.wang@...edance.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        Pedro Tammela <pctammela@...atatu.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Joe Stringer <joe@...ium.io>
Subject: Re: [RFC Patch bpf-next] bpf: introduce bpf timer

On Tue, Apr 27, 2021 at 11:34 AM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Tue, Apr 27, 2021 at 9:36 AM Cong Wang <xiyou.wangcong@...il.com> wrote:
> >
> > If we enforce this ownership, in case of conntrack the owner would be
> > the program which sees the connection first, which is pretty much
> > unpredictable. For example, if the ingress program sees a connection
> > first, it installs a timer for this connection, but the traffic is
> > bidirectional,
> > hence egress program needs this connection and its timer too, we
> > should not remove this timer when the ingress program is freed.
>
> Sure. That's trivially achieved with pinning.

If users forget to do so, their ebpf program would crash the kernel,
right? But ebpf programs should never crash the kernel, right?

> One can have an ingress prog that tailcalls into another prog
> that arms the timer with one of its subprogs.
> Egress prog can tailcall into the same prog as well.
> The ingress and egress progs can be replaced one by one
> or removed both together and middle prog can stay alive
> if it's pinned in bpffs or held alive by FD.

This looks necessarily complex. Look at the overhead of using
a timer properly here:

1. pin timer callback program
2. a program to install timer
3. a program array contains the above program
4. a tail call into the above program array

Why not design a simpler solution?

>
> > From another point of view: maps and programs are both first-class
> > resources in eBPF, a timer is stored in a map and associated with a
> > program, so it is naturally a first-class resource too.
>
> Not really. The timer abstraction is about data. It invokes the callback.
> That callback is a part of the program. The lifetime of the timer object
> and lifetime of the callback can be different.
> Obviously the timer logic need to make sure that callback text is alive
> when the timer is armed.

Only if the callback could reference struct bpf_prog... And even if it
could, how about users forgetting to do so? ebpf verifier has to reject
such cases.

> Combining timer and callback concepts creates a messy abstraction.
> In the normal kernel code one can have a timer in any kernel data
> structure and callback in the kernel text or in the kernel module.
> The code needs to make sure that the module won't go away while
> the timer is armed. Same thing with bpf progs. The progs are safe
> kernel modules. The timers are independent objects.

Kernel modules can take reference count of its own module very
easily, plus there is no verifier for kernel modules. I don't understand
why you want to make ebpf programs as close to kernel modules as
possible in this case.

>
> > >
> > > > >
> > > > > Also if your colleagues have something to share they should be
> > > > > posting to the mailing list. Right now you're acting as a broken phone
> > > > > passing info back and forth and the knowledge gets lost.
> > > > > Please ask your colleagues to participate online.
> > > >
> > > > They are already in CC from the very beginning. And our use case is
> > > > public, it is Cilium conntrack:
> > > > https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack.h
> > > >
> > > > The entries of the code are:
> > > > https://github.com/cilium/cilium/blob/master/bpf/bpf_lxc.c
> > > >
> > > > The maps for conntrack are:
> > > > https://github.com/cilium/cilium/blob/master/bpf/lib/conntrack_map.h
> > >
> > > If that's the only goal then kernel timers are not needed.
> > > cilium conntrack works well as-is.
> >
> > We don't go back to why user-space cleanup is inefficient again,
> > do we? ;)
>
> I remain unconvinced that cilium conntrack _needs_ timer apis.
> It works fine in production and I don't hear any complaints
> from cilium users. So 'user space cleanup inefficiencies' is
> very subjective and cannot be the reason to add timer apis.

I am pretty sure I showed the original report to you when I sent
timeout hashmap patch, in case you forgot here it is again:
https://github.com/cilium/cilium/issues/5048

and let me quote the original report here:

"The current implementation (as of v1.2) for managing the contents of
the datapath connection tracking map leaves something to be desired:
Once per minute, the userspace cilium-agent makes a series of calls to
the bpf() syscall to fetch all of the entries in the map to determine
whether they should be deleted. For each entry in the map, 2-3 calls
must be made: One to fetch the next key, one to fetch the value, and
perhaps one to delete the entry. The maximum size of the map is 1
million entries, and if the current count approaches this size then
the garbage collection goroutine may spend a significant number of CPU
cycles iterating and deleting elements from the conntrack map."

(Adding Joe in Cc too.)

Thanks.