[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190830143508.73c30631@cakuba.netronome.com>
Date: Fri, 30 Aug 2019 14:35:08 -0700
From: Jakub Kicinski <jakub.kicinski@...ronome.com>
To: Yonghong Song <yhs@...com>
Cc: Alexei Starovoitov <ast@...com>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Brian Vazquez <brianvv@...gle.com>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <Kernel-team@...com>,
Quentin Monnet <quentin.monnet@...ronome.com>
Subject: Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote:
> On 8/29/19 11:39 AM, Jakub Kicinski wrote:
> > On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:
> >> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
> >> map entries per syscall.
> >> https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
> >>
> >> During discussion, we found more use cases can be supported in a similar
> >> map operation batching framework. For example, batched map lookup and delete,
> >> which can be really helpful for bcc.
> >> https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
> >> https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
> >>
> >> Also, in bcc, we have API to delete all entries in a map.
> >> https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
> >>
> >> For map update, batched operations also useful as sometimes applications need
> >> to populate initial maps with more than one entry. For example, the below
> >> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
> >> https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
> >>
> >> This patch addresses all the above use cases. To make uapi stable, it also
> >> covers other potential use cases. Four bpf syscall subcommands are introduced:
> >> BPF_MAP_LOOKUP_BATCH
> >> BPF_MAP_LOOKUP_AND_DELETE_BATCH
> >> BPF_MAP_UPDATE_BATCH
> >> BPF_MAP_DELETE_BATCH
> >>
> >> In userspace, application can iterate through the whole map one batch
> >> as a time, e.g., bpf_map_lookup_batch() in the below:
> >> p_key = NULL;
> >> p_next_key = &key;
> >> while (true) {
> >> err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
> >> &batch_size, elem_flags, flags);
> >> if (err) ...
> >> if (p_next_key) break; // done
> >> if (!p_key) p_key = p_next_key;
> >> }
> >> Please look at individual patches for details of new syscall subcommands
> >> and examples of user codes.
> >>
> >> The testing is also done in a qemu VM environment:
> >> measure_lookup: max_entries 1000000, batch 10, time 342ms
> >> measure_lookup: max_entries 1000000, batch 1000, time 295ms
> >> measure_lookup: max_entries 1000000, batch 1000000, time 270ms
> >> measure_lookup: max_entries 1000000, no batching, time 1346ms
> >> measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
> >> measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
> >> measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
> >> measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
> >> measure_delete: max_entries 1000000, batch, time 220ms
> >> measure_delete: max_entries 1000000, not batch, time 1289ms
> >> For a 1M entry hash table, batch size of 10 can reduce cpu time
> >> by 70%. Please see patch "tools/bpf: measure map batching perf"
> >> for details of test codes.
> >
> > Hi Yonghong!
> >
> > great to see this, we have been looking at implementing some way to
> > speed up map walks as well.
> >
> > The direction we were looking in, after previous discussions [1],
> > however, was to provide a BPF program which can run the logic entirely
> > within the kernel.
> >
> > We have a rough PoC on the FW side (we can offload the program which
> > walks the map, which is pretty neat), but the kernel verifier side
> > hasn't really progressed. It will soon.
> >
> > The rough idea is that the user space provides two programs, "filter"
> > and "dumper":
> >
> > bpftool map exec id XYZ filter pinned /some/prog \
> > dumper pinned /some/other_prog
> >
> > Both programs get this context:
> >
> > struct map_op_ctx {
> > u64 key;
> > u64 value;
> > }
> >
> > We need a per-map implementation of the exec side, but roughly maps
> > would do:
> >
> > LIST_HEAD(deleted);
> >
> > for entry in map {
> > struct map_op_ctx {
> > .key = entry->key,
> > .value = entry->value,
> > };
> >
> > act = BPF_PROG_RUN(filter, &map_op_ctx);
> > if (act & ~ACT_BITS)
> > return -EINVAL;
> >
> > if (act & DELETE) {
> > map_unlink(entry);
> > list_add(entry, &deleted);
> > }
> > if (act & STOP)
> > break;
> > }
> >
> > synchronize_rcu();
> >
> > for entry in deleted {
> > struct map_op_ctx {
> > .key = entry->key,
> > .value = entry->value,
> > };
> >
> > BPF_PROG_RUN(dumper, &map_op_ctx);
> > map_free(entry);
> > }
> >
> > The filter program can't perform any map operations other than lookup,
> > otherwise we won't be able to guarantee that we'll walk the entire map
> > (if the filter program deletes some entries in a unfortunate order).
>
> Looks like you will provide a new program type and per-map
> implementation of above code. My patch set indeed avoided per-map
> implementation for all of lookup/delete/get-next-key...
Indeed, the simple batched ops are undeniably lower LoC.
> > If user space just wants a pure dump it can simply load a program which
> > dumps the entries into a perf ring.
>
> percpu perf ring is not really ideal for user space which simply just
> want to get some key/value pairs back. Some kind of generate non-per-cpu
> ring buffer might be better for such cases.
I don't think it had to be per-cpu, but I may be blissfully ignorant
about the perf ring details :) bpf_perf_event_output() takes flags,
which are effectively selecting the "output CPU", no?
> > I'm bringing this up because that mechanism should cover what is
> > achieved with this patch set and much more.
>
> The only case it did not cover is batched update. But that may not
> be super critical.
Right, my other concern (which admittedly is slightly pedantic) is the
potential loss of modifications. the lookup_and_delete() operation does
not guarantee that the result returned to user space is the "final"
value. We'd need to wait for a RCU grace period between lookup and dump.
The "map is definitely unused now"/RCU guarantee had came up in the
past.
> Your approach give each element an action choice through another bpf
> program. This indeed powerful. My use case is simpler than your use case
> below, hence the implementation.
Agreed, the simplicity is tempting.
> > In particular for networking workloads where old flows have to be
> > pruned from the map periodically it's far more efficient to communicate
> > to user space only the flows which timed out (the delete batching from
> > this set won't help at all).
>
> Maybe LRU map will help in this case? It is designed for such
> use cases.
LRU map would be perfect if it dumped the entry somewhere before it got
reused... Perhaps we need to attach a program to the LRU map that'd get
run when entries get reaped. That'd be cool 🤔 We could trivially reuse
the "dumper" prog type for this.
> > With a 2M entry map and this patch set we still won't be able to prune
> > once a second on one core.
> >
> > [1]
> > https://lore.kernel.org/netdev/20190813130921.10704-4-quentin.monnet@netronome.com/
Powered by blists - more mailing lists