[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAH0uvogCWPp8n68iCWOWc8xcv+ZSOMEjRVUvYt4ntF2Q5+Tapg@mail.gmail.com>
Date: Fri, 21 Mar 2025 10:19:46 -0700
From: Howard Chu <howardchu95@...il.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>, Ian Rogers <irogers@...gle.com>,
Kan Liang <kan.liang@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-perf-users@...r.kernel.org, bpf@...r.kernel.org,
Song Liu <song@...nel.org>
Subject: Re: [PATCH v2] perf trace: Implement syscall summary in BPF
Hello Namhyung,
On Thu, Mar 20, 2025 at 10:54 PM Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Thu, Mar 20, 2025 at 07:35:01PM -0700, Howard Chu wrote:
> > Hello again Namhyung,
> >
> > As funny as it sounds, I have too much homework this week. I had to
> > break the review into two parts. Sorry.
>
> Thanks for taking your time!
>
> >
> > 1) Maybe just '--bpf-summary' instead?
> >
> > First of all, is '-s --bpf-summary' is it ergonomic? Why not drop the
> > -s and just --bpf-summary since the option has 'summary' in its name.
> > Another reason being,
> > sudo ./perf trace -S --bpf-summary --summary-mode=total -- sleep 1
> > sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> > are the same (-S will emit no output to stdout).
>
> Hmm.. it looks like a bug, will take a look. Maybe --bpf-summary is
> redundant, but I think we can make it default later so that -s/-S can
> use BPF without the option excplicitly. Then this option can be used
> to disable it (--no-bpf-summary) for whatever reasons.
>
> >
> > 2) Anomaly observed when playing around
> >
> > sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> > this gave me 10000 events
> >
> > sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> > while this gave me 1000 events
> >
> > I guess it's something to do with the lost events?
>
> No, as you said in the previous message it didn't support process
> targets yet. I plan to disable it without -a for now.
I copied the wrong command in the email, it should be:
there is a difference between
sudo ./perf trace -as --summary-mode=total -- sleep 1
sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
[sudo] password for howard:
Summary of events:
total, 15354 events
perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
Summary of events:
total, 1319 events
without the --bpf-summary perf trace gave more events, and it ran slower
as for 'I plan to disable it without -a for now.' I think this makes sense.
>
> >
> > 3) Wrong stddev values
> > Please compare these two outputs
> >
> > perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
> >
> > Summary of events:
> >
> > total, 11290 events
> >
> > syscall calls errors total min avg
> > max stddev
> > (msec) (msec) (msec)
> > (msec) (%)
> > --------------- -------- ------ -------- --------- ---------
> > --------- ------
> > mq_open 214 71 16073.976 0.000 75.112
> > 250.120 9.91%
> > futex 1296 195 11592.060 0.000 8.944
> > 907.590 13.59%
> > epoll_wait 479 0 4262.456 0.000 8.899
> > 496.568 20.34%
> > poll 241 0 2545.090 0.000 10.561
> > 607.894 33.33%
> > ppoll 330 0 1713.676 0.000 5.193
> > 410.143 26.45%
> > migrate_pages 45 0 1031.915 0.000 22.931
> > 147.830 20.70%
> > clock_nanosleep 2 0 1000.106 0.000 500.053
> > 1000.106 100.00%
> > swapoff 340 0 909.827 0.000 2.676
> > 50.117 22.76%
> > pselect6 5 0 604.816 0.000 120.963
> > 604.808 100.00%
> > readlinkat 26 3 501.205 0.000 19.277
> > 499.998 99.75%
> >
> > perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> >
> > Summary of events:
> >
> > total, 880 events
> >
> > syscall calls errors total min avg
> > max stddev
> > (msec) (msec) (msec)
> > (msec) (%)
> > --------------- -------- ------ -------- --------- ---------
> > --------- ------
> > futex 219 46 2326.400 0.001 10.623
> > 243.028 337.77%
> > mq_open 19 8 2001.347 0.003 105.334
> > 250.356 117.26%
> > poll 6 1 1002.512 0.002 167.085
> > 1002.496 223.60%
> > clock_nanosleep 1 0 1000.147 1000.147 1000.147
> > 1000.147 0.00%
> > swapoff 43 0 953.251 0.001 22.169
> > 50.390 112.37%
> > migrate_pages 43 0 933.727 0.004 21.715
> > 49.149 106.68%
> > ppoll 32 0 838.035 0.002 26.189
> > 331.222 252.10%
> > epoll_pwait 5 0 499.578 0.001 99.916
> > 499.565 199.99%
> > nanosleep 1 0 10.149 10.149 10.149
> > 10.149 0.00%
> > epoll_wait 10 0 3.449 0.003 0.345
> > 0.815 88.02%
> > readlinkat 25 3 1.424 0.006 0.057
> > 0.080 41.76%
> > recvmsg 61 0 1.326 0.016 0.022
> > 0.052 21.71%
> > execve 6 5 1.100 0.002 0.183
> > 1.078 218.21%
> >
> > I would say stddev here is a little off. The reason is:
> >
> > On Mon, Mar 17, 2025 at 11:08 AM Namhyung Kim <namhyung@...nel.org> wrote:
> > >
> > > When -s/--summary option is used, it doesn't need (augmented) arguments
> > > of syscalls. Let's skip the augmentation and load another small BPF
> > > program to collect the statistics in the kernel instead of copying the
> > > data to the ring-buffer to calculate the stats in userspace. This will
> > > be much more light-weight than the existing approach and remove any lost
> > > events.
> > >
> > > Let's add a new option --bpf-summary to control this behavior. I cannot
> > > make it default because there's no way to get e_machine in the BPF which
> > > is needed for detecting different ABIs like 32-bit compat mode.
> > >
> > > No functional changes intended except for no more LOST events. :)
> > >
> > > $ sudo perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> > >
> > > Summary of events:
> > >
> > > total, 2824 events
> > >
> > > syscall calls errors total min avg max stddev
> > > (msec) (msec) (msec) (msec) (%)
> > > --------------- -------- ------ -------- --------- --------- --------- ------
> > > futex 372 18 4373.773 0.000 11.757 997.715 660.42%
> > > poll 241 0 2757.963 0.000 11.444 997.758 580.34%
> > > epoll_wait 161 0 2460.854 0.000 15.285 325.189 260.73%
> > > ppoll 19 0 1298.652 0.000 68.350 667.172 281.46%
> > > clock_nanosleep 1 0 1000.093 0.000 1000.093 1000.093 0.00%
> > > epoll_pwait 16 0 192.787 0.000 12.049 173.994 348.73%
> > > nanosleep 6 0 50.926 0.000 8.488 10.210 43.96%
> > > ...
> > >
> > > Cc: Howard Chu <howardchu95@...il.com>
> > > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > > ---
> > > v2)
> > > * rebased on top of Ian's e_machine changes
> > > * add --bpf-summary option
> > > * support per-thread summary
> > > * add stddev calculation (Howard)
> > <SNIP>
> > > +static double rel_stddev(struct syscall_stats *stat)
> > > +{
> > > + double variance, average;
> > > +
> > > + if (stat->count < 2)
> > > + return 0;
> > > +
> > > + average = (double)stat->total_time / stat->count;
> > > +
> > > + variance = stat->squared_sum;
> > > + variance -= (stat->total_time * stat->total_time) / stat->count;
> > > + variance /= stat->count;
> >
> > isn't it 'variance /= stat->count - 1' because we used Bessel's
> > correction? (Link:
> > https://en.wikipedia.org/wiki/Bessel%27s_correction), that is to use n
> > - 1 instead of n, this is what's done in stat.c.
> >
> > * (\Sum n_i^2) - ((\Sum n_i)^2)/n
> > * s^2 = -------------------------------
> > * n - 1
> >
> > and the lines down here are unfortunately incorrect
> > + variance = stat->squared_sum;
> > + variance -= (stat->total_time * stat->total_time) / stat->count;
> > + variance /= stat->count;
> > +
> > + return 100 * sqrt(variance) / average;
> >
> > variance /= stat->count - 1; will get you variance, but I think we
> > need variance mean.
> > Link: https://en.wikipedia.org/wiki/Standard_deviation#Relationship_between_standard_deviation_and_mean
> >
> > it holds that:
> > variance(mean) = variance / N
> >
> > so you are losing a '/ stat->count'
>
> You're right, thanks for pointing that out.
>
> >
> > And with all due respect, although it makes total sense in
> > engineering, mathematically, I find variance = stat->squared_sum,
> > variance -= ... these accumulated calculations on variable 'variance'
> > a little weird... because readers may find difficult to determine at
> > which point it becomes the actual 'variance'
> >
> > with clarity in mind:
> >
> > diff --git a/tools/perf/util/bpf-trace-summary.c
> > b/tools/perf/util/bpf-trace-summary.c
> > index 5ae9feca244d..a435b4037082 100644
> > --- a/tools/perf/util/bpf-trace-summary.c
> > +++ b/tools/perf/util/bpf-trace-summary.c
> > @@ -62,18 +62,18 @@ struct syscall_node {
> >
> > static double rel_stddev(struct syscall_stats *stat)
> > {
> > - double variance, average;
> > + double variance, average, squared_total;
> >
> > if (stat->count < 2)
> > return 0;
> >
> > average = (double)stat->total_time / stat->count;
> >
> > - variance = stat->squared_sum;
> > - variance -= (stat->total_time * stat->total_time) / stat->count;
> > - variance /= stat->count;
> > + squared_total = stat->total_time * stat->total_time;
> > + variance = (stat->squared_sum - squared_total / stat->count) /
> > (stat->count - 1);
> > + stddev_mean = sqrt(variance / stat->count);
> >
> > - return 100 * sqrt(variance) / average;
> > + return 100 * stddev_mean / average;
> > }
>
> Can it be like this?
>
> diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c
> index a91d42447e850a59..c897fb017914960c 100644
> --- a/tools/perf/util/bpf-trace-summary.c
> +++ b/tools/perf/util/bpf-trace-summary.c
> @@ -71,9 +71,9 @@ static double rel_stddev(struct syscall_stats *stat)
>
> variance = stat->squared_sum;
> variance -= (stat->total_time * stat->total_time) / stat->count;
> - variance /= stat->count;
> + variance /= stat->count - 1;
>
> - return 100 * sqrt(variance) / average;
> + return 100 * sqrt(variance / stat->count) / average;
> }
Of course. Then the variable 'variance' would mean 'variance of mean'.
>
> struct syscall_data {
>
> >
> > btw I haven't checked the legal range for stddev_mean, so I can be wrong.
> > <SNIP>
> > > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key,
> > > + struct syscall_stats *map_data)
> > > +{
> > > + struct syscall_data *data;
> > > + struct syscall_stats *stat;
> > > +
> > > + if (!hashmap__find(hash, map_key, &data)) {
> > > + data = zalloc(sizeof(*data));
> > > + if (data == NULL)
> > > + return -ENOMEM;
> > > +
> > > + data->nodes = zalloc(sizeof(*data->nodes));
> > > + if (data->nodes == NULL) {
> > > + free(data);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + data->nr_nodes = 1;
> > > + data->key = map_key->nr;
> > > + data->nodes->syscall_nr = data->key;
> > Wow, aggressive. I guess you want it to behave like a single value
> > when it is SYSCALL_AGGR_CPU, and an array when it is
> > SYSCALL_AGGR_THREAD. Do you mind adding a comment about it?
> >
> > so it's
> >
> > (cpu, syscall_nr) -> data -> {node}
> > (tid, syscall_nr) -> data -> [node1, node2, node3]
>
> Right, will add comments.
>
> >
> >
> > > +
> > > + if (hashmap__add(hash, data->key, data) < 0) {
> > > + free(data->nodes);
> > > + free(data);
> > > + return -ENOMEM;
> > > + }
> > > + }
> > > +
> > > + /* update total stats for this syscall */
> > > + data->nr_events += map_data->count;
> > > + data->total_time += map_data->total_time;
> > > +
> > > + /* This is sum of the same syscall from different CPUs */
> > > + stat = &data->nodes->stats;
> > > +
> > > + stat->total_time += map_data->total_time;
> > > + stat->squared_sum += map_data->squared_sum;
> > > + stat->count += map_data->count;
> > > + stat->error += map_data->error;
> > > +
> > > + if (stat->max_time < map_data->max_time)
> > > + stat->max_time = map_data->max_time;
> > > + if (stat->min_time > map_data->min_time)
> > > + stat->min_time = map_data->min_time;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp)
> > > +{
> > > + int printed = 0;
> > > + int nr_events = 0;
> > > +
> > > + for (int i = 0; i < nr_data; i++)
> > > + nr_events += data[i]->nr_events;
> > > +
> > > + printed += fprintf(fp, " total, %d events\n\n", nr_events);
> > > +
> > > + printed += fprintf(fp, " syscall calls errors total min avg max stddev\n");
> > > + printed += fprintf(fp, " (msec) (msec) (msec) (msec) (%%)\n");
> > > + printed += fprintf(fp, " --------------- -------- ------ -------- --------- --------- --------- ------\n");
> > > +
> > > + for (int i = 0; i < nr_data; i++)
> > > + printed += print_common_stats(data[i], fp);
> > > +
> > > + printed += fprintf(fp, "\n\n");
> > > + return printed;
> > > +}
> > > +
> > > +int trace_print_bpf_summary(FILE *fp)
> > > +{
> > > + struct bpf_map *map = skel->maps.syscall_stats_map;
> > > + struct syscall_key *prev_key, key;
> > > + struct syscall_data **data = NULL;
> > > + struct hashmap schash;
> > > + struct hashmap_entry *entry;
> > > + int nr_data = 0;
> > > + int printed = 0;
> > > + int i;
> > > + size_t bkt;
> > > +
> > > + hashmap__init(&schash, sc_node_hash, sc_node_equal, /*ctx=*/NULL);
> > > +
> > > + printed = fprintf(fp, "\n Summary of events:\n\n");
> > > +
> > > + /* get stats from the bpf map */
> > > + prev_key = NULL;
> > > + while (!bpf_map__get_next_key(map, prev_key, &key, sizeof(key))) {
> > > + struct syscall_stats stat;
> > > +
> > > + if (!bpf_map__lookup_elem(map, &key, sizeof(key), &stat, sizeof(stat), 0)) {
> > > + if (skel->rodata->aggr_mode == SYSCALL_AGGR_THREAD)
> > > + update_thread_stats(&schash, &key, &stat);
> > > + else
> > > + update_total_stats(&schash, &key, &stat);
> > > + }
> > > +
> > > + prev_key = &key;
> > > + }
> > > +
> > > + nr_data = hashmap__size(&schash);
> > > + data = calloc(nr_data, sizeof(*data));
> > > + if (data == NULL)
> > > + goto out;
> > > +
> > > + i = 0;
> > > + hashmap__for_each_entry(&schash, entry, bkt)
> > > + data[i++] = entry->pvalue;
> > > +
> > > + qsort(data, nr_data, sizeof(*data), datacmp);
> >
> > Here syscall_data is sorted for AGGR_THREAD and AGGR_CPU, meaning the
> > thread who has the higher total syscall period will be printed first.
> > This is an awesome side effect but it is not the behavior of 'sudo
> > ./perf trace -as -- sleep 1' without the --bpf-summary option. If it
> > is not too trivial, maybe consider documenting this behavior? But it
> > may be too verbose so Idk.
>
> The original behavior is random ordering and now it's ordered by total
> time. But still we can think it's randomly ordered. I believe we need
> to maintain strict ordering by total time and then fixed the existing
> code. Once that happen I can add the documentation.
ofc.
>
> Thanks for your careful review!
You are welcome.
> Namhyung
Thanks,
Howard
Powered by blists - more mailing lists