netdev - Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQL70bVdms6_D_ep1L2v-OcgXu-9KTtLULQdfCMftLhENQ@mail.gmail.com>
Date:   Tue, 15 Dec 2020 18:35:35 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Cong Wang <cong.wang@...edance.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Dongdong Wang <wangdongdong.6@...edance.com>
Subject: Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

On Tue, Dec 15, 2020 at 6:10 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>
> On Tue, Dec 15, 2020 at 5:14 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Tue, Dec 15, 2020 at 04:22:21PM -0800, Cong Wang wrote:
> > > On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@...earbox.net> wrote:
> > > >
> > > > On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > > > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
> > > > >>
> > > > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > > > >> <andrii.nakryiko@...il.com> wrote:
> > > > >>>
> > > > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
> > > > >>>>
> > > > >>>> From: Cong Wang <cong.wang@...edance.com>
> > > > >>>>
> > > > >>>> This borrows the idea from conntrack and will be used for conntrack in
> > > > >>>> bpf too. Each element in a timeout map has a user-specified timeout
> > > > >>>> in secs, after it expires it will be automatically removed from the map.
> > > > [...]
> > > > >>>>          char key[] __aligned(8);
> > > > >>>>   };
> > > > >>>>
> > > > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> > > > >>>>
> > > > >>>>          for (i = 0; i < htab->n_buckets; i++) {
> > > > >>>>                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> > > > >>>> +               atomic_set(&htab->buckets[i].pending, 0);
> > > > >>>>                  if (htab_use_raw_lock(htab)) {
> > > > >>>>                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
> > > > >>>>                          lockdep_set_class(&htab->buckets[i].raw_lock,
> > > > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> > > > >>>>          return 0;
> > > > >>>>   }
> > > > >>>>
> > > > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> > > > >>>> +{
> > > > >>>> +       if (atomic_fetch_or(1, &b->pending))
> > > > >>>> +               return;
> > > > >>>> +       llist_add(&b->gc_node, &htab->gc_list);
> > > > >>>> +       queue_work(system_unbound_wq, &htab->gc_work);
> > > > >>>> +}
> > > > >>>
> > > > >>> I'm concerned about each bucket being scheduled individually... And
> > > > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> > > > >>> scheduling independently. Can you think about the way to have a
> > > > >>> "global" gc/purging logic, and just make sure that buckets that need
> > > > >>> processing would be just internally chained together. So the purging
> > > > >>> routing would iterate all the scheduled hashmaps, and within each it
> > > > >>> will have a linked list of buckets that need processing? And all that
> > > > >>> is done just once each GC period. Not N times for N maps or N*M times
> > > > >>> for N maps with M buckets in each.
> > > > >>
> > > > >> Our internal discussion went to the opposite actually, people here argued
> > > > >> one work is not sufficient for a hashtable because there would be millions
> > > > >> of entries (max_entries, which is also number of buckets). ;)
> > > > >
> > > > > I was hoping that it's possible to expire elements without iterating
> > > > > the entire hash table every single time, only items that need to be
> > > > > processed. Hashed timing wheel is one way to do something like this,
> > > > > kernel has to solve similar problems with timeouts as well, why not
> > > > > taking inspiration there?
> > > >
> > > > Couldn't this map be coupled with LRU map for example through flag on map
> > > > creation so that the different LRU map flavors can be used with it? For BPF
> > > > CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> > > > wonder if for that case you then still need to schedule a GC at all.. e.g.
> > > > if you hit the condition time_after_eq64(now, entry->expires) you'd just
> > > > re-link the expired element from the public htab to e.g. the LRU's local
> > > > CPU's free/pending-list instead.
> > >
> > > I doubt we can use size as a limit to kick off GC or LRU, it must be
> > > time-based. And in case of idle, there has to be an async GC, right?
> >
> > Why does it have to be time based?
>
> Because it is how a session timeouts? For instance, CT uses
> nf_conntrack_udp_timeout to timeout UDP sessions. Or are we going
> to redefine conntrack?

hmm. I see no reason to re-implement conntrack as-is in bpf.

> > Why LRU alone is not enough?
> > People implemented conntracker in bpf using LRU map.
>
> Sure, people also implement CT on native hash map too and timeout
> with user-space timers. ;)

exactly. what's wrong with that?
Perfectly fine way to do CT.

> > Anything extra can be added on top from user space
> > which can easily copy with 1 sec granularity.
>
> The problem is never about granularity, it is about how efficient we can
> GC. User-space has to scan the whole table one by one, while the kernel
> can just do this behind the scene with a much lower overhead.
>
> Let's say we arm a timer for each entry in user-space, it requires a syscall
> and locking buckets each time for each entry. Kernel could do it without
> any additional syscall and batching. Like I said above, we could have
> millions of entries, so the overhead would be big in this scenario.

and the user space can pick any other implementation instead
of trivial entry by entry gc with timer.

> > Say the kernel does GC and deletes htab entries.
> > How user space will know that it's gone? There would need to be
>
> By a lookup.
>
> > an event sent to user space when entry is being deleted by the kernel.
> > But then such event will be racy. Instead when timers and expirations
> > are done by user space everything is in sync.
>
> Why there has to be an event?

because when any production worthy implementation moves
past the prototype stage there is something that user space needs to keep
as well. Sometimes the bpf map in the kernel is alone.
But a lot of times there is a user space mirror of the map in c++ or golang
with the same key where user space keeps extra data.