[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80ca0fba-8a30-4f1f-6bf3-7ccaa1fa8d69@fb.com>
Date: Tue, 12 May 2020 07:39:18 -0700
From: Yonghong Song <yhs@...com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
CC: Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
Networking <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...com>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner
infrastructure
On 5/11/20 8:29 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:10 AM Yonghong Song <yhs@...com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> While working on BPF ringbuf implementation, testing, and benchmarking, I've
>>> developed a pretty generic and modular benchmark runner, which seems to be
>>> generically useful, as I've already used it for one more purpose (testing
>>> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
>>>
>>> This patch adds generic part of benchmark runner and sets up Makefile for
>>> extending it with more sets of benchmarks.
>>>
>>> Benchmarker itself operates by spinning up specified number of producer and
>>> consumer threads, setting up interval timer sending SIGALARM signal to
>>> application once a second. Every second, current snapshot with hits/drops
>>> counters are collected and stored in an array. Drops are useful for
>>> producer/consumer benchmarks in which producer might overwhelm consumers.
>>>
>>> Once test finishes after given amount of warm-up and testing seconds, mean and
>>> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
>>> This setup seems to give consistent and accurate results.
>>>
>>> To validate behavior, I added two atomic counting tests: global and local.
>>> For global one, all the producer threads are atomically incrementing same
>>> counter as fast as possible. This, of course, leads to huge drop of
>>> performance once there is more than one producer thread due to CPUs fighting
>>> for the same memory location.
>>>
>>> Local counting, on the other hand, maintains one counter per each producer
>>> thread, incremented independently. Once per second, all counters are read and
>>> added together to form final "counting throughput" measurement. As expected,
>>> such setup demonstrates linear scalability with number of producers (as long
>>> as there are enough physical CPU cores, of course). See example output below.
>>> Also, this setup can nicely demonstrate disastrous effects of false sharing,
>>> if care is not taken to take those per-producer counters apart into
>>> independent cache lines.
>>>
>>> Demo output shows global counter first with 1 producer, then with 4. Both
>>> total and per-producer performance significantly drop. The last run is local
>>> counter with 4 producers, demonstrating near-perfect scalability.
>>>
>>> $ ./bench -a -w1 -d2 -p1 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter 0 ( 24.822us): hits 148.179M/s (148.179M/prod), drops 0.000M/s
>>> Iter 1 ( 37.939us): hits 149.308M/s (149.308M/prod), drops 0.000M/s
>>> Iter 2 (-10.774us): hits 150.717M/s (150.717M/prod), drops 0.000M/s
>>> Iter 3 ( 3.807us): hits 151.435M/s (151.435M/prod), drops 0.000M/s
>>> Summary: hits 150.488 ± 1.079M/s (150.488M/prod), drops 0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter 0 ( 60.659us): hits 53.910M/s ( 13.477M/prod), drops 0.000M/s
>>> Iter 1 (-17.658us): hits 53.722M/s ( 13.431M/prod), drops 0.000M/s
>>> Iter 2 ( 5.865us): hits 53.495M/s ( 13.374M/prod), drops 0.000M/s
>>> Iter 3 ( 0.104us): hits 53.606M/s ( 13.402M/prod), drops 0.000M/s
>>> Summary: hits 53.608 ± 0.113M/s ( 13.402M/prod), drops 0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-local
>>> Setting up benchmark 'count-local'...
>>> Benchmark 'count-local' started.
>>> Iter 0 ( 23.388us): hits 640.450M/s (160.113M/prod), drops 0.000M/s
>>> Iter 1 ( 2.291us): hits 605.661M/s (151.415M/prod), drops 0.000M/s
>>> Iter 2 ( -6.415us): hits 607.092M/s (151.773M/prod), drops 0.000M/s
>>> Iter 3 ( -1.361us): hits 601.796M/s (150.449M/prod), drops 0.000M/s
>>> Summary: hits 604.849 ± 2.739M/s (151.212M/prod), drops 0.000 ± 0.000M/s
>>>
>>> Signed-off-by: Andrii Nakryiko <andriin@...com>
>>> ---
>>> tools/testing/selftests/bpf/.gitignore | 1 +
>>> tools/testing/selftests/bpf/Makefile | 13 +-
>>> tools/testing/selftests/bpf/bench.c | 372 ++++++++++++++++++
>>> tools/testing/selftests/bpf/bench.h | 74 ++++
>>> .../selftests/bpf/benchs/bench_count.c | 91 +++++
>>> 5 files changed, 550 insertions(+), 1 deletion(-)
>>> create mode 100644 tools/testing/selftests/bpf/bench.c
>>> create mode 100644 tools/testing/selftests/bpf/bench.h
>>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
>>>
>
> [...]
>
> trimming is good :)
>
>>> +
>>> +void hits_drops_report_final(struct bench_res res[], int res_cnt)
>>> +{
>>> + int i;
>>> + double hits_mean = 0.0, drops_mean = 0.0;
>>> + double hits_stddev = 0.0, drops_stddev = 0.0;
>>> +
>>> + for (i = 0; i < res_cnt; i++) {
>>> + hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>>> + drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
>>> + }
>>> +
>>> + if (res_cnt > 1) {
>>> + for (i = 0; i < res_cnt; i++) {
>>> + hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>>> + (hits_mean - res[i].hits / 1000000.0) /
>>> + (res_cnt - 1.0);
>>> + drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
>>> + (drops_mean - res[i].drops / 1000000.0) /
>>> + (res_cnt - 1.0);
>>> + }
>>> + hits_stddev = sqrt(hits_stddev);
>>> + drops_stddev = sqrt(drops_stddev);
>>> + }
>>> + printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>> + hits_mean, hits_stddev, hits_mean / env.producer_cnt);
>>> + printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
>>> + drops_mean, drops_stddev);
>>
>> The unicode char \u00B1 (for +|-) may cause some old compiler (e.g.,
>> 4.8.5) warnings as it needs C99 mode.
>>
>> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9:
>> warning: universal character names are only valid in C++ and C99
>> [enabled by default]
>> printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>
>> "+|-" is alternative, but not as good as \u00B1 indeed. Newer
>> compiler may already have the default C99. Maybe we can just add
>> C99 for build `bench`?
>
> I think I'm fine with ancient compiler emitting harmless warning for
> code under selftests/bpf, honestly...
>
>>
>>> +}
>>> +
>>> +const char *argp_program_version = "benchmark";
>>> +const char *argp_program_bug_address = "<bpf@...r.kernel.org>";
>>> +const char argp_program_doc[] =
>>> +"benchmark Generic benchmarking framework.\n"
>>> +"\n"
>>> +"This tool runs benchmarks.\n"
>>> +"\n"
>>> +"USAGE: benchmark <bench-name>\n"
>>> +"\n"
>>> +"EXAMPLES:\n"
>>> +" # run 'count-local' benchmark with 1 producer and 1 consumer\n"
>>> +" benchmark count-local\n"
>>> +" # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
>>> +" benchmark -p16 -c8 -a count-local\n";
>>
>> Some of the above global variables probably are statics.
>> But I do not have a strong preference on this.
>
> Oh, it's actually global variables argp library expects, they have to be global.
>
>>
>>> +
>>> +static const struct argp_option opts[] = {
>>> + { "list", 'l', NULL, 0, "List available benchmarks"},
>>> + { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
>>> + { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
>>> + { "producers", 'p', "NUM", 0, "Number of producer threads"},
>>> + { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
>>> + { "verbose", 'v', NULL, 0, "Verbose debug output"},
>>> + { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
>>> + { "b2b", 'b', NULL, 0, "Back-to-back mode"},
>>> + { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
>>
>> I did not see b2b and rb-output options are processed in this file.
>
> Slipped through the rebasing cracks from the future ringbuf
> benchmarks, will remove it. I also figured out a way to do this more
> modular anyways (child parsers in argp).
>
>>
>>> + {},
>>> +};
>>> +
>
> [...]
>
>>> + for (i = 0; i < env.consumer_cnt; i++) {
>>> + err = pthread_create(&state.consumers[i], NULL,
>>> + bench->consumer_thread, (void *)(long)i);
>>> + if (err) {
>>> + fprintf(stderr, "failed to create consumer thread #%d: %d\n",
>>> + i, -errno);
>>> + exit(1);
>>> + }
>>> + if (env.affinity)
>>> + set_thread_affinity(state.consumers[i], i);
>>> + }
>>> + for (i = 0; i < env.producer_cnt; i++) {
>>> + err = pthread_create(&state.producers[i], NULL,
>>> + bench->producer_thread, (void *)(long)i);
>>> + if (err) {
>>> + fprintf(stderr, "failed to create producer thread #%d: %d\n",
>>> + i, -errno);
>>> + exit(1);
>>> + }
>>> + if (env.affinity)
>>> + set_thread_affinity(state.producers[i],
>>> + env.consumer_cnt + i);
>>
>> Here, we bind consumer threads first and then producer threads, I think
>> this is probably just arbitrary choice?
>
> yep, almost arbitrary. Most of my cases have 1 consumer and >=1
> producers, so it was convenient to have consumer pinned to same CPU,
> regardless of how many producers I have.
>
>>
>> In certain cases, I think people may want to have more advanced binding
>> scenarios, e.g., for hyperthreading, binding consumer and producer on
>> the same core or different cores etc. One possibility is to introduce
>> -c option similar to taskset. If -c not supplied, you can have
>> the current default. Otherwise, using -c list.
>>
>
> well, taskset's job is simpler, it takes a list of CPUs for single
> PID, if I understand correctly. Here we have many threads, each might
> have different CPU or even CPUs. But I agree that for some benchmarks
> it's going to be critical to control this precisely. Here's how I'm
> going to allows most flexibility without too much complexity.
>
> --prod-affinity 1,2,10-16,100 -- will specify a set of CPUs for
> producers. First producer will use CPU with least index form that
> list. Second will take second, and so on. If there are less CPUs
> provided than necessary - it's an error. If more - it's fine.
>
> Then for consumers will add independent --cons-affinity parameters,
> which will do the same for consumer threads.
>
> Having two independent lists will allow to test scenarios where we
> want producers and consumers to fight for the same CPU.
>
> Does this sound ok?
This sounds good, more than what I asked.
>
>>> + }
>>> +
>>> + printf("Benchmark '%s' started.\n", bench->name);
>>> +}
>>> +
>
> [...]
Powered by blists - more mailing lists