lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 12 May 2020 07:39:18 -0700
From:   Yonghong Song <yhs@...com>
To:     Andrii Nakryiko <andrii.nakryiko@...il.com>
CC:     Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner
 infrastructure



On 5/11/20 8:29 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:10 AM Yonghong Song <yhs@...com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> While working on BPF ringbuf implementation, testing, and benchmarking, I've
>>> developed a pretty generic and modular benchmark runner, which seems to be
>>> generically useful, as I've already used it for one more purpose (testing
>>> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
>>>
>>> This patch adds generic part of benchmark runner and sets up Makefile for
>>> extending it with more sets of benchmarks.
>>>
>>> Benchmarker itself operates by spinning up specified number of producer and
>>> consumer threads, setting up interval timer sending SIGALARM signal to
>>> application once a second. Every second, current snapshot with hits/drops
>>> counters are collected and stored in an array. Drops are useful for
>>> producer/consumer benchmarks in which producer might overwhelm consumers.
>>>
>>> Once test finishes after given amount of warm-up and testing seconds, mean and
>>> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
>>> This setup seems to give consistent and accurate results.
>>>
>>> To validate behavior, I added two atomic counting tests: global and local.
>>> For global one, all the producer threads are atomically incrementing same
>>> counter as fast as possible. This, of course, leads to huge drop of
>>> performance once there is more than one producer thread due to CPUs fighting
>>> for the same memory location.
>>>
>>> Local counting, on the other hand, maintains one counter per each producer
>>> thread, incremented independently. Once per second, all counters are read and
>>> added together to form final "counting throughput" measurement. As expected,
>>> such setup demonstrates linear scalability with number of producers (as long
>>> as there are enough physical CPU cores, of course). See example output below.
>>> Also, this setup can nicely demonstrate disastrous effects of false sharing,
>>> if care is not taken to take those per-producer counters apart into
>>> independent cache lines.
>>>
>>> Demo output shows global counter first with 1 producer, then with 4. Both
>>> total and per-producer performance significantly drop. The last run is local
>>> counter with 4 producers, demonstrating near-perfect scalability.
>>>
>>> $ ./bench -a -w1 -d2 -p1 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
>>> Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
>>> Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
>>> Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
>>> Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
>>> Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
>>> Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
>>> Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
>>> Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-local
>>> Setting up benchmark 'count-local'...
>>> Benchmark 'count-local' started.
>>> Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
>>> Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
>>> Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
>>> Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
>>> Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
>>>
>>> Signed-off-by: Andrii Nakryiko <andriin@...com>
>>> ---
>>>    tools/testing/selftests/bpf/.gitignore        |   1 +
>>>    tools/testing/selftests/bpf/Makefile          |  13 +-
>>>    tools/testing/selftests/bpf/bench.c           | 372 ++++++++++++++++++
>>>    tools/testing/selftests/bpf/bench.h           |  74 ++++
>>>    .../selftests/bpf/benchs/bench_count.c        |  91 +++++
>>>    5 files changed, 550 insertions(+), 1 deletion(-)
>>>    create mode 100644 tools/testing/selftests/bpf/bench.c
>>>    create mode 100644 tools/testing/selftests/bpf/bench.h
>>>    create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
>>>
> 
> [...]
> 
> trimming is good :)
> 
>>> +
>>> +void hits_drops_report_final(struct bench_res res[], int res_cnt)
>>> +{
>>> +     int i;
>>> +     double hits_mean = 0.0, drops_mean = 0.0;
>>> +     double hits_stddev = 0.0, drops_stddev = 0.0;
>>> +
>>> +     for (i = 0; i < res_cnt; i++) {
>>> +             hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>>> +             drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
>>> +     }
>>> +
>>> +     if (res_cnt > 1)  {
>>> +             for (i = 0; i < res_cnt; i++) {
>>> +                     hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>>> +                                    (hits_mean - res[i].hits / 1000000.0) /
>>> +                                    (res_cnt - 1.0);
>>> +                     drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
>>> +                                     (drops_mean - res[i].drops / 1000000.0) /
>>> +                                     (res_cnt - 1.0);
>>> +             }
>>> +             hits_stddev = sqrt(hits_stddev);
>>> +             drops_stddev = sqrt(drops_stddev);
>>> +     }
>>> +     printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>> +            hits_mean, hits_stddev, hits_mean / env.producer_cnt);
>>> +     printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
>>> +            drops_mean, drops_stddev);
>>
>> The unicode char \u00B1 (for +|-) may cause some old compiler (e.g.,
>> 4.8.5) warnings as it needs C99 mode.
>>
>> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9:
>> warning: universal character names are only valid in C++ and C99
>> [enabled by default]
>>     printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>
>> "+|-" is alternative, but not as good as \u00B1 indeed. Newer
>> compiler may already have the default C99. Maybe we can just add
>> C99 for build `bench`?
> 
> I think I'm fine with ancient compiler emitting harmless warning for
> code under selftests/bpf, honestly...
> 
>>
>>> +}
>>> +
>>> +const char *argp_program_version = "benchmark";
>>> +const char *argp_program_bug_address = "<bpf@...r.kernel.org>";
>>> +const char argp_program_doc[] =
>>> +"benchmark    Generic benchmarking framework.\n"
>>> +"\n"
>>> +"This tool runs benchmarks.\n"
>>> +"\n"
>>> +"USAGE: benchmark <bench-name>\n"
>>> +"\n"
>>> +"EXAMPLES:\n"
>>> +"    # run 'count-local' benchmark with 1 producer and 1 consumer\n"
>>> +"    benchmark count-local\n"
>>> +"    # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
>>> +"    benchmark -p16 -c8 -a count-local\n";
>>
>> Some of the above global variables probably are statics.
>> But I do not have a strong preference on this.
> 
> Oh, it's actually global variables argp library expects, they have to be global.
> 
>>
>>> +
>>> +static const struct argp_option opts[] = {
>>> +     { "list", 'l', NULL, 0, "List available benchmarks"},
>>> +     { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
>>> +     { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
>>> +     { "producers", 'p', "NUM", 0, "Number of producer threads"},
>>> +     { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
>>> +     { "verbose", 'v', NULL, 0, "Verbose debug output"},
>>> +     { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
>>> +     { "b2b", 'b', NULL, 0, "Back-to-back mode"},
>>> +     { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
>>
>> I did not see b2b and rb-output options are processed in this file.
> 
> Slipped through the rebasing cracks from the future ringbuf
> benchmarks, will remove it. I also figured out a way to do this more
> modular anyways (child parsers in argp).
> 
>>
>>> +     {},
>>> +};
>>> +
> 
> [...]
> 
>>> +     for (i = 0; i < env.consumer_cnt; i++) {
>>> +             err = pthread_create(&state.consumers[i], NULL,
>>> +                                  bench->consumer_thread, (void *)(long)i);
>>> +             if (err) {
>>> +                     fprintf(stderr, "failed to create consumer thread #%d: %d\n",
>>> +                             i, -errno);
>>> +                     exit(1);
>>> +             }
>>> +             if (env.affinity)
>>> +                     set_thread_affinity(state.consumers[i], i);
>>> +     }
>>> +     for (i = 0; i < env.producer_cnt; i++) {
>>> +             err = pthread_create(&state.producers[i], NULL,
>>> +                                  bench->producer_thread, (void *)(long)i);
>>> +             if (err) {
>>> +                     fprintf(stderr, "failed to create producer thread #%d: %d\n",
>>> +                             i, -errno);
>>> +                     exit(1);
>>> +             }
>>> +             if (env.affinity)
>>> +                     set_thread_affinity(state.producers[i],
>>> +                                         env.consumer_cnt + i);
>>
>> Here, we bind consumer threads first and then producer threads, I think
>> this is probably just arbitrary choice?
> 
> yep, almost arbitrary. Most of my cases have 1 consumer and >=1
> producers, so it was convenient to have consumer pinned to same CPU,
> regardless of how many producers I have.
> 
>>
>> In certain cases, I think people may want to have more advanced binding
>> scenarios, e.g., for hyperthreading, binding consumer and producer on
>> the same core or different cores etc. One possibility is to introduce
>> -c option similar to taskset. If -c not supplied, you can have
>> the current default. Otherwise, using -c list.
>>
> 
> well, taskset's job is simpler, it takes a list of CPUs for single
> PID, if I understand correctly. Here we have many threads, each might
> have different CPU or even CPUs. But I agree that for some benchmarks
> it's going to be critical to control this precisely. Here's how I'm
> going to allows most flexibility without too much complexity.
> 
> --prod-affinity 1,2,10-16,100 -- will specify a set of CPUs for
> producers. First producer will use CPU with least index form that
> list. Second will take second, and so on. If there are less CPUs
> provided than necessary - it's an error. If more - it's fine.
> 
> Then for consumers will add independent --cons-affinity parameters,
> which will do the same for consumer threads.
> 
> Having two independent lists will allow to test scenarios where we
> want producers and consumers to fight for the same CPU.
> 
> Does this sound ok?

This sounds good, more than what I asked.

> 
>>> +     }
>>> +
>>> +     printf("Benchmark '%s' started.\n", bench->name);
>>> +}
>>> +
> 
> [...]

Powered by blists - more mailing lists