[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ba9b492-8a08-a1d0-9c6e-03be4b8e5e07@fb.com>
Date: Fri, 30 Aug 2019 22:38:16 +0000
From: Yonghong Song <yhs@...com>
To: Jakub Kicinski <jakub.kicinski@...ronome.com>
CC: Alexei Starovoitov <ast@...com>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Brian Vazquez <brianvv@...gle.com>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <Kernel-team@...com>,
Quentin Monnet <quentin.monnet@...ronome.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
On 8/30/19 2:35 PM, Jakub Kicinski wrote:
> On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote:
>> On 8/29/19 11:39 AM, Jakub Kicinski wrote:
>>> On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:
>>>> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
>>>> map entries per syscall.
>>>> https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
>>>>
>>>> During discussion, we found more use cases can be supported in a similar
>>>> map operation batching framework. For example, batched map lookup and delete,
>>>> which can be really helpful for bcc.
>>>> https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
>>>> https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
>>>>
>>>> Also, in bcc, we have API to delete all entries in a map.
>>>> https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
>>>>
>>>> For map update, batched operations also useful as sometimes applications need
>>>> to populate initial maps with more than one entry. For example, the below
>>>> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
>>>> https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
>>>>
>>>> This patch addresses all the above use cases. To make uapi stable, it also
>>>> covers other potential use cases. Four bpf syscall subcommands are introduced:
>>>> BPF_MAP_LOOKUP_BATCH
>>>> BPF_MAP_LOOKUP_AND_DELETE_BATCH
>>>> BPF_MAP_UPDATE_BATCH
>>>> BPF_MAP_DELETE_BATCH
>>>>
>>>> In userspace, application can iterate through the whole map one batch
>>>> as a time, e.g., bpf_map_lookup_batch() in the below:
>>>> p_key = NULL;
>>>> p_next_key = &key;
>>>> while (true) {
>>>> err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
>>>> &batch_size, elem_flags, flags);
>>>> if (err) ...
>>>> if (p_next_key) break; // done
>>>> if (!p_key) p_key = p_next_key;
>>>> }
>>>> Please look at individual patches for details of new syscall subcommands
>>>> and examples of user codes.
>>>>
>>>> The testing is also done in a qemu VM environment:
>>>> measure_lookup: max_entries 1000000, batch 10, time 342ms
>>>> measure_lookup: max_entries 1000000, batch 1000, time 295ms
>>>> measure_lookup: max_entries 1000000, batch 1000000, time 270ms
>>>> measure_lookup: max_entries 1000000, no batching, time 1346ms
>>>> measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
>>>> measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
>>>> measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
>>>> measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
>>>> measure_delete: max_entries 1000000, batch, time 220ms
>>>> measure_delete: max_entries 1000000, not batch, time 1289ms
>>>> For a 1M entry hash table, batch size of 10 can reduce cpu time
>>>> by 70%. Please see patch "tools/bpf: measure map batching perf"
>>>> for details of test codes.
>>>
>>> Hi Yonghong!
>>>
>>> great to see this, we have been looking at implementing some way to
>>> speed up map walks as well.
>>>
>>> The direction we were looking in, after previous discussions [1],
>>> however, was to provide a BPF program which can run the logic entirely
>>> within the kernel.
>>>
>>> We have a rough PoC on the FW side (we can offload the program which
>>> walks the map, which is pretty neat), but the kernel verifier side
>>> hasn't really progressed. It will soon.
>>>
>>> The rough idea is that the user space provides two programs, "filter"
>>> and "dumper":
>>>
>>> bpftool map exec id XYZ filter pinned /some/prog \
>>> dumper pinned /some/other_prog
>>>
>>> Both programs get this context:
>>>
>>> struct map_op_ctx {
>>> u64 key;
>>> u64 value;
>>> }
>>>
>>> We need a per-map implementation of the exec side, but roughly maps
>>> would do:
>>>
>>> LIST_HEAD(deleted);
>>>
>>> for entry in map {
>>> struct map_op_ctx {
>>> .key = entry->key,
>>> .value = entry->value,
>>> };
>>>
>>> act = BPF_PROG_RUN(filter, &map_op_ctx);
>>> if (act & ~ACT_BITS)
>>> return -EINVAL;
>>>
>>> if (act & DELETE) {
>>> map_unlink(entry);
>>> list_add(entry, &deleted);
>>> }
>>> if (act & STOP)
>>> break;
>>> }
>>>
>>> synchronize_rcu();
>>>
>>> for entry in deleted {
>>> struct map_op_ctx {
>>> .key = entry->key,
>>> .value = entry->value,
>>> };
>>>
>>> BPF_PROG_RUN(dumper, &map_op_ctx);
>>> map_free(entry);
>>> }
>>>
>>> The filter program can't perform any map operations other than lookup,
>>> otherwise we won't be able to guarantee that we'll walk the entire map
>>> (if the filter program deletes some entries in a unfortunate order).
>>
>> Looks like you will provide a new program type and per-map
>> implementation of above code. My patch set indeed avoided per-map
>> implementation for all of lookup/delete/get-next-key...
>
> Indeed, the simple batched ops are undeniably lower LoC.
>
>>> If user space just wants a pure dump it can simply load a program which
>>> dumps the entries into a perf ring.
>>
>> percpu perf ring is not really ideal for user space which simply just
>> want to get some key/value pairs back. Some kind of generate non-per-cpu
>> ring buffer might be better for such cases.
>
> I don't think it had to be per-cpu, but I may be blissfully ignorant
> about the perf ring details :) bpf_perf_event_output() takes flags,
> which are effectively selecting the "output CPU", no?
Right, it does not need to be per-cpu. One particular cpu
can be selected. Binding to which cpu might be always
subject to debate like why this cpu, not another cpu.
This works, and I am just thinking a ring buffer without
binding to cpu is better and less confusion to user.
But this may need yet another ring buffer implementation
in the kernel and people might not like it.
Powered by blists - more mailing lists