netdev - Re: [PATCH bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbO_Ya7uCcRdrOECnC4rV82OXqE97XoFVyumi8WNBNZMQ@mail.gmail.com>
Date:   Fri, 8 May 2020 10:59:17 -0700
From:   Andrii Nakryiko <andrii.nakryiko@...il.com>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure

On Fri, May 8, 2020 at 8:49 AM John Fastabend <john.fastabend@...il.com> wrote:
>
> Andrii Nakryiko wrote:
> > While working on BPF ringbuf implementation, testing, and benchmarking, I've
> > developed a pretty generic and modular benchmark runner, which seems to be
> > generically useful, as I've already used it for one more purpose (testing
> > fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> >
> > This patch adds generic part of benchmark runner and sets up Makefile for
> > extending it with more sets of benchmarks.
>
> Seems useful.

thanks :)

>
> >
> > Benchmarker itself operates by spinning up specified number of producer and
> > consumer threads, setting up interval timer sending SIGALARM signal to
> > application once a second. Every second, current snapshot with hits/drops
> > counters are collected and stored in an array. Drops are useful for
> > producer/consumer benchmarks in which producer might overwhelm consumers.
> >
> > Once test finishes after given amount of warm-up and testing seconds, mean and
> > stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> > This setup seems to give consistent and accurate results.
> >
> > To validate behavior, I added two atomic counting tests: global and local.
> > For global one, all the producer threads are atomically incrementing same
> > counter as fast as possible. This, of course, leads to huge drop of
> > performance once there is more than one producer thread due to CPUs fighting
> > for the same memory location.
> >
> > Local counting, on the other hand, maintains one counter per each producer
> > thread, incremented independently. Once per second, all counters are read and
> > added together to form final "counting throughput" measurement. As expected,
> > such setup demonstrates linear scalability with number of producers (as long
> > as there are enough physical CPU cores, of course). See example output below.
> > Also, this setup can nicely demonstrate disastrous effects of false sharing,
> > if care is not taken to take those per-producer counters apart into
> > independent cache lines.
> >
> > Demo output shows global counter first with 1 producer, then with 4. Both
> > total and per-producer performance significantly drop. The last run is local
> > counter with 4 producers, demonstrating near-perfect scalability.
> >
> > $ ./bench -a -w1 -d2 -p1 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> > Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> > Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> > Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> > Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> > Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> > Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> > Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> > Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-local
> > Setting up benchmark 'count-local'...
> > Benchmark 'count-local' started.
> > Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> > Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> > Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> > Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> > Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> >
> > Signed-off-by: Andrii Nakryiko <andriin@...com>
> > ---
>
> Couple nits but otherwise lgtm. I think it should probably be moved
> into its own directory though ./bpf/bench/

I assume you are talking about benchmark implementations themselve,
all those bench_xxx.c files, right? bench.c probably should stay in
selftests/bpf root.

>
> The other question would be how much stuff do we want to live in
> selftests vs outside selftests/bpf but I think its fine and makes
> it easy to build small benchmark programs in ./bpf/progs/

selftests/bpf Makefile is so convenient for BPF/skeleton/user-space
building, libbpf, vmlinux.h generation, etc, that moving this outside
would be a major pain and lots of extra work. Adding this benchmark
was trivial from Makefile modification point of view (and no debugging
of Makefile either, everything just worked).

>
> >  tools/testing/selftests/bpf/.gitignore    |   1 +
> >  tools/testing/selftests/bpf/Makefile      |  11 +-
> >  tools/testing/selftests/bpf/bench.c       | 364 ++++++++++++++++++++++
> >  tools/testing/selftests/bpf/bench.h       |  74 +++++
> >  tools/testing/selftests/bpf/bench_count.c |  91 ++++++
> >  5 files changed, 540 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/bpf/bench.c
> >  create mode 100644 tools/testing/selftests/bpf/bench.h
> >  create mode 100644 tools/testing/selftests/bpf/bench_count.c
> >
> > diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> > index 3ff031972975..1bb204cee853 100644
> > --- a/tools/testing/selftests/bpf/.gitignore
> > +++ b/tools/testing/selftests/bpf/.gitignore
> > @@ -38,3 +38,4 @@ test_cpp
> >  /bpf_gcc
> >  /tools
> >  /runqslower
> > +/bench
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 3d942be23d09..ab03362d46e4 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
> >  # Compile but not part of 'make run_tests'
> >  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
> >       flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> > -     test_lirc_mode2_user xdping test_cpp runqslower
> > +     test_lirc_mode2_user xdping test_cpp runqslower bench
> >
> >  TEST_CUSTOM_PROGS = urandom_read
> >
> > @@ -405,6 +405,15 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
> >       $(call msg,CXX,,$@)
> >       $(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
> >
> > +# Benchmark runner
> > +$(OUTPUT)/bench.o:          bench.h
> > +$(OUTPUT)/bench_count.o:    bench.h
> > +$(OUTPUT)/bench: LDLIBS += -lm
> > +$(OUTPUT)/bench: $(OUTPUT)/bench.o \
> > +              $(OUTPUT)/bench_count.o
> > +     $(call msg,BINARY,,$@)
> > +     $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> > +
> >  EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)                   \
> >       prog_tests/tests.h map_tests/tests.h verifier/tests.h           \
> >       feature                                                         \
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > new file mode 100644
> > index 000000000000..a20482bb74e2
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -0,0 +1,364 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2020 Facebook */
> > +#define _GNU_SOURCE
> > +#include <argp.h>
> > +#include <linux/compiler.h>
> > +#include <sys/time.h>
> > +#include <sched.h>
> > +#include <fcntl.h>
> > +#include <pthread.h>
> > +#include <sys/sysinfo.h>
> > +#include <sys/resource.h>
> > +#include <signal.h>
> > +#include "bench.h"
> > +
> > +struct env env = {
> > +     .duration_sec = 10,
> > +     .warmup_sec = 5,
>
> Just curious I'm guessing the duration/warmap are arbitrary here? Seems
> a bit long I would bet 5,1 would be enough for global/local test at
> least.

Yeah, completely arbitrary. I started with just 1 second, but for some
benchmark stabilization came around second 4-5, so I bumped it to 5.
It's easy to modify with -d and -w arguments, but I can bump it down
to 5, 1 for defaults.

>
> > +     .affinity = false,
> > +     .consumer_cnt = 1,
> > +     .producer_cnt = 1,
> > +};
> > +
>
> [...]
>
> > +void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
> > +{
> > +     double hits_per_sec, drops_per_sec;
> > +     double hits_per_prod;
> > +
> > +     hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
> > +     hits_per_prod = hits_per_sec / env.producer_cnt;
>
> Per producer counts would also be useful. Averaging over producer cnt could
> hide issues with fairness.

True about hiding fairness issues, but for benchmarks with lots of
producers, it's so many numbres, that it will be hard to interpret it
per-producer. We could probably add stddev calculation across multiple
producers and stuff like that, but I'd defer that to future
enhancements. This benchmarker is a side-product of BPF ringbuf work,
not the goal in itself.

>
> > +     drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
> > +
> > +     printf("Iter %3d (%7.3lfus): ",
> > +            iter, (delta_ns - 1000000000) / 1000.0);
> > +
> > +     printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
> > +            hits_per_sec, hits_per_prod, drops_per_sec);
> > +}
> > +
>
> [...]
>
> > +const char *argp_program_version = "benchmark";
> > +const char *argp_program_bug_address = "<bpf@...r.kernel.org>";
> > +const char argp_program_doc[] =
> > +"benchmark    Generic benchmarking framework.\n"
> > +"\n"
> > +"This tool runs benchmarks.\n"
> > +"\n"
> > +"USAGE: benchmark <mode>\n"
> > +"\n"
> > +"EXAMPLES:\n"
> > +"    benchmark count-local                # run 'count-local' benchmark with 1 producer and 1 consumer\n"
> > +"    benchmark -p16 -c8 -a count-local    # run 'count-local' benchmark with 16 producer and 8 consumer threads, pinned to CPUs\n";
> > +
> > +static const struct argp_option opts[] = {
> > +     { "mode", 'm', "MODE", 0, "Benchmark mode"},
>
> "Benchmark mode" hmm not sure what this is for yet. Only on
> first patch though so maybe I'll become enlightened?

Oh, actually I don't need it, it's just a positional argument, I'll
drop this line.

>
> > +     { "list", 'l', NULL, 0, "List available benchmarks"},
> > +     { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
> > +     { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
> > +     { "producers", 'p', "NUM", 0, "Number of producer threads"},
> > +     { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> > +     { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > +     { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > +     { "b2b", 'b', NULL, 0, "Back-to-back mode"},
> > +     { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
> > +     {},
> > +};
>
> [...]
>
> > +
> > +static void set_thread_affinity(pthread_t thread, int cpu)
> > +{
> > +     cpu_set_t cpuset;
> > +
> > +     CPU_ZERO(&cpuset);
> > +     CPU_SET(cpu, &cpuset);
> > +     if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset))
> > +             printf("setting affinity to CPU #%d failed: %d\n", cpu, errno);
> > +}
>
> Should we error out on affinity errors?

Given I made affinity setting optional (in the end), I guess I could
make it fail.

>
> > +
> > +static struct bench_state {
> > +     int res_cnt;
> > +     struct bench_res *results;
> > +     pthread_t *consumers;
> > +     pthread_t *producers;
> > +} state;
>
> [...]
>
> > +
> > +static void setup_benchmark()
> > +{
> > +     int i, err;
> > +
> > +     if (!env.mode) {
> > +             fprintf(stderr, "benchmark mode is not specified\n");
> > +             exit(1);
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(benchs); i++) {
> > +             if (strcmp(benchs[i]->name, env.mode) == 0) {
>
> Ah the mode. OK maybe in description call it, "Benchmark mode to run" or
> "Benchmark test"? Or leave it its probably fine.

How about bench_name?


>
> > +                     bench = benchs[i];
> > +                     break;
> > +             }
> > +     }
> > +     if (!bench) {
> > +             fprintf(stderr, "benchmark '%s' not found\n", env.mode);
> > +             exit(1);
> > +     }
> > +
> > +     printf("Setting up benchmark '%s'...\n", bench->name);
> > +
> > +     state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
> > +     state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
> > +     state.results = calloc(env.duration_sec + env.warmup_sec + 2,
> > +                            sizeof(*state.results));
> > +     if (!state.producers || !state.consumers || !state.results)
> > +             exit(1);
> > +
> > +     if (bench->validate)
> > +             bench->validate();
> > +     if (bench->setup)
> > +             bench->setup();
> > +
> > +     for (i = 0; i < env.consumer_cnt; i++) {
> > +             err = pthread_create(&state.consumers[i], NULL,
> > +                                  bench->consumer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create consumer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.consumers[i], i);
> > +     }
> > +     for (i = 0; i < env.producer_cnt; i++) {
> > +             err = pthread_create(&state.producers[i], NULL,
> > +                                  bench->producer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create producer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.producers[i],
> > +                                         env.consumer_cnt + i);
> > +     }
> > +
> > +     printf("Benchmark '%s' started.\n", bench->name);
> > +}
>
> [...]
>
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bench_count.c
>
> How about a ./bpf/bench/ directory? Seems we are going to get a few
> bench_* tests here.
>

Sounds good to me, I'll move.