netdev - Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c57e287b-ac99-381f-cb9a-96e613d9ce68@iogearbox.net>
Date:   Thu, 17 Dec 2020 23:39:49 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     Andrii Nakryiko <andrii.nakryiko@...il.com>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Cong Wang <cong.wang@...edance.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Dongdong Wang <wangdongdong.6@...edance.com>
Subject: Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

On 12/16/20 1:22 AM, Cong Wang wrote:
> On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>>
>> On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
>>> On Tue, Dec 15, 2020 at 12:06 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>>>>
>>>> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
>>>> <andrii.nakryiko@...il.com> wrote:
>>>>>
>>>>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang <xiyou.wangcong@...il.com> wrote:
>>>>>>
>>>>>> From: Cong Wang <cong.wang@...edance.com>
>>>>>>
>>>>>> This borrows the idea from conntrack and will be used for conntrack in
>>>>>> bpf too. Each element in a timeout map has a user-specified timeout
>>>>>> in secs, after it expires it will be automatically removed from the map.
>> [...]
>>>>>>           char key[] __aligned(8);
>>>>>>    };
>>>>>>
>>>>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
>>>>>>
>>>>>>           for (i = 0; i < htab->n_buckets; i++) {
>>>>>>                   INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
>>>>>> +               atomic_set(&htab->buckets[i].pending, 0);
>>>>>>                   if (htab_use_raw_lock(htab)) {
>>>>>>                           raw_spin_lock_init(&htab->buckets[i].raw_lock);
>>>>>>                           lockdep_set_class(&htab->buckets[i].raw_lock,
>>>>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
>>>>>>           return 0;
>>>>>>    }
>>>>>>
>>>>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
>>>>>> +{
>>>>>> +       if (atomic_fetch_or(1, &b->pending))
>>>>>> +               return;
>>>>>> +       llist_add(&b->gc_node, &htab->gc_list);
>>>>>> +       queue_work(system_unbound_wq, &htab->gc_work);
>>>>>> +}
>>>>>
>>>>> I'm concerned about each bucket being scheduled individually... And
>>>>> similarly concerned that each instance of TIMEOUT_HASH will do its own
>>>>> scheduling independently. Can you think about the way to have a
>>>>> "global" gc/purging logic, and just make sure that buckets that need
>>>>> processing would be just internally chained together. So the purging
>>>>> routing would iterate all the scheduled hashmaps, and within each it
>>>>> will have a linked list of buckets that need processing? And all that
>>>>> is done just once each GC period. Not N times for N maps or N*M times
>>>>> for N maps with M buckets in each.
>>>>
>>>> Our internal discussion went to the opposite actually, people here argued
>>>> one work is not sufficient for a hashtable because there would be millions
>>>> of entries (max_entries, which is also number of buckets). ;)
>>>
>>> I was hoping that it's possible to expire elements without iterating
>>> the entire hash table every single time, only items that need to be
>>> processed. Hashed timing wheel is one way to do something like this,
>>> kernel has to solve similar problems with timeouts as well, why not
>>> taking inspiration there?
>>
>> Couldn't this map be coupled with LRU map for example through flag on map
>> creation so that the different LRU map flavors can be used with it? For BPF
>> CT use case we do rely on LRU map to purge 'inactive' entries once full. I
>> wonder if for that case you then still need to schedule a GC at all.. e.g.
>> if you hit the condition time_after_eq64(now, entry->expires) you'd just
>> re-link the expired element from the public htab to e.g. the LRU's local
>> CPU's free/pending-list instead.
> 
> I doubt we can use size as a limit to kick off GC or LRU, it must be
> time-based. And in case of idle, there has to be an async GC, right?

I was thinking no GC at all, meaning, above mentioned re-linking of expired
elements would be done lazily e.g. whenever we walk a given bucket (e.g. on
lookup/update/delete) under the assumption we don't have deep lists there to
keep the time comparison not too expensive and that element migration has low
overhead (e.g. move to local CPU free-list).