[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9z-_ftY23KcZ6c1@google.com>
Date: Thu, 20 Mar 2025 22:54:05 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Howard Chu <howardchu95@...il.com>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
Ian Rogers <irogers@...gle.com>,
Kan Liang <kan.liang@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-perf-users@...r.kernel.org, bpf@...r.kernel.org,
Song Liu <song@...nel.org>
Subject: Re: [PATCH v2] perf trace: Implement syscall summary in BPF
On Thu, Mar 20, 2025 at 07:35:01PM -0700, Howard Chu wrote:
> Hello again Namhyung,
>
> As funny as it sounds, I have too much homework this week. I had to
> break the review into two parts. Sorry.
Thanks for taking your time!
>
> 1) Maybe just '--bpf-summary' instead?
>
> First of all, is '-s --bpf-summary' is it ergonomic? Why not drop the
> -s and just --bpf-summary since the option has 'summary' in its name.
> Another reason being,
> sudo ./perf trace -S --bpf-summary --summary-mode=total -- sleep 1
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> are the same (-S will emit no output to stdout).
Hmm.. it looks like a bug, will take a look. Maybe --bpf-summary is
redundant, but I think we can make it default later so that -s/-S can
use BPF without the option excplicitly. Then this option can be used
to disable it (--no-bpf-summary) for whatever reasons.
>
> 2) Anomaly observed when playing around
>
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> this gave me 10000 events
>
> sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> while this gave me 1000 events
>
> I guess it's something to do with the lost events?
No, as you said in the previous message it didn't support process
targets yet. I plan to disable it without -a for now.
>
> 3) Wrong stddev values
> Please compare these two outputs
>
> perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
>
> Summary of events:
>
> total, 11290 events
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> mq_open 214 71 16073.976 0.000 75.112
> 250.120 9.91%
> futex 1296 195 11592.060 0.000 8.944
> 907.590 13.59%
> epoll_wait 479 0 4262.456 0.000 8.899
> 496.568 20.34%
> poll 241 0 2545.090 0.000 10.561
> 607.894 33.33%
> ppoll 330 0 1713.676 0.000 5.193
> 410.143 26.45%
> migrate_pages 45 0 1031.915 0.000 22.931
> 147.830 20.70%
> clock_nanosleep 2 0 1000.106 0.000 500.053
> 1000.106 100.00%
> swapoff 340 0 909.827 0.000 2.676
> 50.117 22.76%
> pselect6 5 0 604.816 0.000 120.963
> 604.808 100.00%
> readlinkat 26 3 501.205 0.000 19.277
> 499.998 99.75%
>
> perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
>
> Summary of events:
>
> total, 880 events
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 219 46 2326.400 0.001 10.623
> 243.028 337.77%
> mq_open 19 8 2001.347 0.003 105.334
> 250.356 117.26%
> poll 6 1 1002.512 0.002 167.085
> 1002.496 223.60%
> clock_nanosleep 1 0 1000.147 1000.147 1000.147
> 1000.147 0.00%
> swapoff 43 0 953.251 0.001 22.169
> 50.390 112.37%
> migrate_pages 43 0 933.727 0.004 21.715
> 49.149 106.68%
> ppoll 32 0 838.035 0.002 26.189
> 331.222 252.10%
> epoll_pwait 5 0 499.578 0.001 99.916
> 499.565 199.99%
> nanosleep 1 0 10.149 10.149 10.149
> 10.149 0.00%
> epoll_wait 10 0 3.449 0.003 0.345
> 0.815 88.02%
> readlinkat 25 3 1.424 0.006 0.057
> 0.080 41.76%
> recvmsg 61 0 1.326 0.016 0.022
> 0.052 21.71%
> execve 6 5 1.100 0.002 0.183
> 1.078 218.21%
>
> I would say stddev here is a little off. The reason is:
>
> On Mon, Mar 17, 2025 at 11:08 AM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > When -s/--summary option is used, it doesn't need (augmented) arguments
> > of syscalls. Let's skip the augmentation and load another small BPF
> > program to collect the statistics in the kernel instead of copying the
> > data to the ring-buffer to calculate the stats in userspace. This will
> > be much more light-weight than the existing approach and remove any lost
> > events.
> >
> > Let's add a new option --bpf-summary to control this behavior. I cannot
> > make it default because there's no way to get e_machine in the BPF which
> > is needed for detecting different ABIs like 32-bit compat mode.
> >
> > No functional changes intended except for no more LOST events. :)
> >
> > $ sudo perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> >
> > Summary of events:
> >
> > total, 2824 events
> >
> > syscall calls errors total min avg max stddev
> > (msec) (msec) (msec) (msec) (%)
> > --------------- -------- ------ -------- --------- --------- --------- ------
> > futex 372 18 4373.773 0.000 11.757 997.715 660.42%
> > poll 241 0 2757.963 0.000 11.444 997.758 580.34%
> > epoll_wait 161 0 2460.854 0.000 15.285 325.189 260.73%
> > ppoll 19 0 1298.652 0.000 68.350 667.172 281.46%
> > clock_nanosleep 1 0 1000.093 0.000 1000.093 1000.093 0.00%
> > epoll_pwait 16 0 192.787 0.000 12.049 173.994 348.73%
> > nanosleep 6 0 50.926 0.000 8.488 10.210 43.96%
> > ...
> >
> > Cc: Howard Chu <howardchu95@...il.com>
> > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > ---
> > v2)
> > * rebased on top of Ian's e_machine changes
> > * add --bpf-summary option
> > * support per-thread summary
> > * add stddev calculation (Howard)
> <SNIP>
> > +static double rel_stddev(struct syscall_stats *stat)
> > +{
> > + double variance, average;
> > +
> > + if (stat->count < 2)
> > + return 0;
> > +
> > + average = (double)stat->total_time / stat->count;
> > +
> > + variance = stat->squared_sum;
> > + variance -= (stat->total_time * stat->total_time) / stat->count;
> > + variance /= stat->count;
>
> isn't it 'variance /= stat->count - 1' because we used Bessel's
> correction? (Link:
> https://en.wikipedia.org/wiki/Bessel%27s_correction), that is to use n
> - 1 instead of n, this is what's done in stat.c.
>
> * (\Sum n_i^2) - ((\Sum n_i)^2)/n
> * s^2 = -------------------------------
> * n - 1
>
> and the lines down here are unfortunately incorrect
> + variance = stat->squared_sum;
> + variance -= (stat->total_time * stat->total_time) / stat->count;
> + variance /= stat->count;
> +
> + return 100 * sqrt(variance) / average;
>
> variance /= stat->count - 1; will get you variance, but I think we
> need variance mean.
> Link: https://en.wikipedia.org/wiki/Standard_deviation#Relationship_between_standard_deviation_and_mean
>
> it holds that:
> variance(mean) = variance / N
>
> so you are losing a '/ stat->count'
You're right, thanks for pointing that out.
>
> And with all due respect, although it makes total sense in
> engineering, mathematically, I find variance = stat->squared_sum,
> variance -= ... these accumulated calculations on variable 'variance'
> a little weird... because readers may find difficult to determine at
> which point it becomes the actual 'variance'
>
> with clarity in mind:
>
> diff --git a/tools/perf/util/bpf-trace-summary.c
> b/tools/perf/util/bpf-trace-summary.c
> index 5ae9feca244d..a435b4037082 100644
> --- a/tools/perf/util/bpf-trace-summary.c
> +++ b/tools/perf/util/bpf-trace-summary.c
> @@ -62,18 +62,18 @@ struct syscall_node {
>
> static double rel_stddev(struct syscall_stats *stat)
> {
> - double variance, average;
> + double variance, average, squared_total;
>
> if (stat->count < 2)
> return 0;
>
> average = (double)stat->total_time / stat->count;
>
> - variance = stat->squared_sum;
> - variance -= (stat->total_time * stat->total_time) / stat->count;
> - variance /= stat->count;
> + squared_total = stat->total_time * stat->total_time;
> + variance = (stat->squared_sum - squared_total / stat->count) /
> (stat->count - 1);
> + stddev_mean = sqrt(variance / stat->count);
>
> - return 100 * sqrt(variance) / average;
> + return 100 * stddev_mean / average;
> }
Can it be like this?
diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c
index a91d42447e850a59..c897fb017914960c 100644
--- a/tools/perf/util/bpf-trace-summary.c
+++ b/tools/perf/util/bpf-trace-summary.c
@@ -71,9 +71,9 @@ static double rel_stddev(struct syscall_stats *stat)
variance = stat->squared_sum;
variance -= (stat->total_time * stat->total_time) / stat->count;
- variance /= stat->count;
+ variance /= stat->count - 1;
- return 100 * sqrt(variance) / average;
+ return 100 * sqrt(variance / stat->count) / average;
}
struct syscall_data {
>
> btw I haven't checked the legal range for stddev_mean, so I can be wrong.
> <SNIP>
> > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key,
> > + struct syscall_stats *map_data)
> > +{
> > + struct syscall_data *data;
> > + struct syscall_stats *stat;
> > +
> > + if (!hashmap__find(hash, map_key, &data)) {
> > + data = zalloc(sizeof(*data));
> > + if (data == NULL)
> > + return -ENOMEM;
> > +
> > + data->nodes = zalloc(sizeof(*data->nodes));
> > + if (data->nodes == NULL) {
> > + free(data);
> > + return -ENOMEM;
> > + }
> > +
> > + data->nr_nodes = 1;
> > + data->key = map_key->nr;
> > + data->nodes->syscall_nr = data->key;
> Wow, aggressive. I guess you want it to behave like a single value
> when it is SYSCALL_AGGR_CPU, and an array when it is
> SYSCALL_AGGR_THREAD. Do you mind adding a comment about it?
>
> so it's
>
> (cpu, syscall_nr) -> data -> {node}
> (tid, syscall_nr) -> data -> [node1, node2, node3]
Right, will add comments.
>
>
> > +
> > + if (hashmap__add(hash, data->key, data) < 0) {
> > + free(data->nodes);
> > + free(data);
> > + return -ENOMEM;
> > + }
> > + }
> > +
> > + /* update total stats for this syscall */
> > + data->nr_events += map_data->count;
> > + data->total_time += map_data->total_time;
> > +
> > + /* This is sum of the same syscall from different CPUs */
> > + stat = &data->nodes->stats;
> > +
> > + stat->total_time += map_data->total_time;
> > + stat->squared_sum += map_data->squared_sum;
> > + stat->count += map_data->count;
> > + stat->error += map_data->error;
> > +
> > + if (stat->max_time < map_data->max_time)
> > + stat->max_time = map_data->max_time;
> > + if (stat->min_time > map_data->min_time)
> > + stat->min_time = map_data->min_time;
> > +
> > + return 0;
> > +}
> > +
> > +static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp)
> > +{
> > + int printed = 0;
> > + int nr_events = 0;
> > +
> > + for (int i = 0; i < nr_data; i++)
> > + nr_events += data[i]->nr_events;
> > +
> > + printed += fprintf(fp, " total, %d events\n\n", nr_events);
> > +
> > + printed += fprintf(fp, " syscall calls errors total min avg max stddev\n");
> > + printed += fprintf(fp, " (msec) (msec) (msec) (msec) (%%)\n");
> > + printed += fprintf(fp, " --------------- -------- ------ -------- --------- --------- --------- ------\n");
> > +
> > + for (int i = 0; i < nr_data; i++)
> > + printed += print_common_stats(data[i], fp);
> > +
> > + printed += fprintf(fp, "\n\n");
> > + return printed;
> > +}
> > +
> > +int trace_print_bpf_summary(FILE *fp)
> > +{
> > + struct bpf_map *map = skel->maps.syscall_stats_map;
> > + struct syscall_key *prev_key, key;
> > + struct syscall_data **data = NULL;
> > + struct hashmap schash;
> > + struct hashmap_entry *entry;
> > + int nr_data = 0;
> > + int printed = 0;
> > + int i;
> > + size_t bkt;
> > +
> > + hashmap__init(&schash, sc_node_hash, sc_node_equal, /*ctx=*/NULL);
> > +
> > + printed = fprintf(fp, "\n Summary of events:\n\n");
> > +
> > + /* get stats from the bpf map */
> > + prev_key = NULL;
> > + while (!bpf_map__get_next_key(map, prev_key, &key, sizeof(key))) {
> > + struct syscall_stats stat;
> > +
> > + if (!bpf_map__lookup_elem(map, &key, sizeof(key), &stat, sizeof(stat), 0)) {
> > + if (skel->rodata->aggr_mode == SYSCALL_AGGR_THREAD)
> > + update_thread_stats(&schash, &key, &stat);
> > + else
> > + update_total_stats(&schash, &key, &stat);
> > + }
> > +
> > + prev_key = &key;
> > + }
> > +
> > + nr_data = hashmap__size(&schash);
> > + data = calloc(nr_data, sizeof(*data));
> > + if (data == NULL)
> > + goto out;
> > +
> > + i = 0;
> > + hashmap__for_each_entry(&schash, entry, bkt)
> > + data[i++] = entry->pvalue;
> > +
> > + qsort(data, nr_data, sizeof(*data), datacmp);
>
> Here syscall_data is sorted for AGGR_THREAD and AGGR_CPU, meaning the
> thread who has the higher total syscall period will be printed first.
> This is an awesome side effect but it is not the behavior of 'sudo
> ./perf trace -as -- sleep 1' without the --bpf-summary option. If it
> is not too trivial, maybe consider documenting this behavior? But it
> may be too verbose so Idk.
The original behavior is random ordering and now it's ordered by total
time. But still we can think it's randomly ordered. I believe we need
to maintain strict ordering by total time and then fixed the existing
code. Once that happen I can add the documentation.
Thanks for your careful review!
Namhyung
>
> sudo ./perf trace -as -- sleep 1
>
> ClientModuleMan (4956), 16 events, 0.1%
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 8 4 750.234 0.000 93.779
> 250.105 48.79%
>
>
> CHTTPClientThre (15720), 16 events, 0.1%
>
> syscall calls errors total min avg
> max stddev
> (msec) (msec) (msec)
> (msec) (%)
> --------------- -------- ------ -------- --------- ---------
> --------- ------
> futex 8 4 1000.425 0.000 125.053
> 750.317 75.59%
>
> The order is random for the command above.
>
> Thanks,
> Howard
Powered by blists - more mailing lists