netdev - Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1de72112-d2b8-24c5-de29-0d3dfd361f16@iogearbox.net>
Date:   Wed, 16 Dec 2020 00:23:07 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Cong Wang <xiyou.wangcong@...il.com>
Cc:     Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Cong Wang <cong.wang@...edance.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Dongdong Wang <wangdongdong.6@...edance.com>
Subject: Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>>
>> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
>> <andrii.nakryiko@...il.com> wrote:
>>>
>>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>>>>
>>>> From: Cong Wang <cong.wang@...edance.com>
>>>>
>>>> This borrows the idea from conntrack and will be used for conntrack in
>>>> bpf too. Each element in a timeout map has a user-specified timeout
>>>> in secs, after it expires it will be automatically removed from the map.
[...]
>>>>          char key[] __aligned(8);
>>>>   };
>>>>
>>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
>>>>
>>>>          for (i = 0; i < htab->n_buckets; i++) {
>>>>                  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
>>>> +               atomic_set(&htab->buckets[i].pending, 0);
>>>>                  if (htab_use_raw_lock(htab)) {
>>>>                          raw_spin_lock_init(&htab->buckets[i].raw_lock);
>>>>                          lockdep_set_class(&htab->buckets[i].raw_lock,
>>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
>>>>          return 0;
>>>>   }
>>>>
>>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
>>>> +{
>>>> +       if (atomic_fetch_or(1, &b->pending))
>>>> +               return;
>>>> +       llist_add(&b->gc_node, &htab->gc_list);
>>>> +       queue_work(system_unbound_wq, &htab->gc_work);
>>>> +}
>>>
>>> I'm concerned about each bucket being scheduled individually... And
>>> similarly concerned that each instance of TIMEOUT_HASH will do its own
>>> scheduling independently. Can you think about the way to have a
>>> "global" gc/purging logic, and just make sure that buckets that need
>>> processing would be just internally chained together. So the purging
>>> routing would iterate all the scheduled hashmaps, and within each it
>>> will have a linked list of buckets that need processing? And all that
>>> is done just once each GC period. Not N times for N maps or N*M times
>>> for N maps with M buckets in each.
>>
>> Our internal discussion went to the opposite actually, people here argued
>> one work is not sufficient for a hashtable because there would be millions
>> of entries (max_entries, which is also number of buckets). ;)
> 
> I was hoping that it's possible to expire elements without iterating
> the entire hash table every single time, only items that need to be
> processed. Hashed timing wheel is one way to do something like this,
> kernel has to solve similar problems with timeouts as well, why not
> taking inspiration there?

Couldn't this map be coupled with LRU map for example through flag on map
creation so that the different LRU map flavors can be used with it? For BPF
CT use case we do rely on LRU map to purge 'inactive' entries once full. I
wonder if for that case you then still need to schedule a GC at all.. e.g.
if you hit the condition time_after_eq64(now, entry->expires) you'd just
re-link the expired element from the public htab to e.g. the LRU's local
CPU's free/pending-list instead.

Thanks,
Daniel