linux-kernel - Re: [PATCH v2] perf trace: Implement syscall summary in BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9z-_ftY23KcZ6c1@google.com>
Date: Thu, 20 Mar 2025 22:54:05 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Howard Chu <howardchu95@...il.com>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ian Rogers <irogers@...gle.com>,
	Kan Liang <kan.liang@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
	Adrian Hunter <adrian.hunter@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	linux-perf-users@...r.kernel.org, bpf@...r.kernel.org,
	Song Liu <song@...nel.org>
Subject: Re: [PATCH v2] perf trace: Implement syscall summary in BPF

On Thu, Mar 20, 2025 at 07:35:01PM -0700, Howard Chu wrote:
> Hello again Namhyung,
> 
> As funny as it sounds, I have too much homework this week. I had to
> break the review into two parts. Sorry.

Thanks for taking your time!

> 
> 1) Maybe just '--bpf-summary' instead?
> 
> First of all, is '-s --bpf-summary' is it ergonomic? Why not drop the
> -s and just --bpf-summary since the option has 'summary' in its name.
> Another reason being,
> sudo ./perf trace -S --bpf-summary --summary-mode=total -- sleep 1
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> are the same (-S will emit no output to stdout).

Hmm.. it looks like a bug, will take a look.  Maybe --bpf-summary is
redundant, but I think we can make it default later so that -s/-S can
use BPF without the option excplicitly.  Then this option can be used
to disable it (--no-bpf-summary) for whatever reasons.

> 
> 2) Anomaly observed when playing around
> 
> sudo ./perf trace -s --bpf-summary --summary-mode=total -- sleep 1
> this gave me 10000 events
> 
> sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> while this gave me 1000 events
> 
> I guess it's something to do with the lost events?

No, as you said in the previous message it didn't support process
targets yet.  I plan to disable it without -a for now.

> 
> 3) Wrong stddev values
> Please compare these two outputs
> 
> perf $ sudo ./perf trace -as --summary-mode=total -- sleep 1
> 
>  Summary of events:
> 
>  total, 11290 events
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    mq_open              214     71 16073.976     0.000    75.112
> 250.120      9.91%
>    futex               1296    195 11592.060     0.000     8.944
> 907.590     13.59%
>    epoll_wait           479      0  4262.456     0.000     8.899
> 496.568     20.34%
>    poll                 241      0  2545.090     0.000    10.561
> 607.894     33.33%
>    ppoll                330      0  1713.676     0.000     5.193
> 410.143     26.45%
>    migrate_pages         45      0  1031.915     0.000    22.931
> 147.830     20.70%
>    clock_nanosleep        2      0  1000.106     0.000   500.053
> 1000.106    100.00%
>    swapoff              340      0   909.827     0.000     2.676
> 50.117     22.76%
>    pselect6               5      0   604.816     0.000   120.963
> 604.808    100.00%
>    readlinkat            26      3   501.205     0.000    19.277
> 499.998     99.75%
> 
> perf $ sudo ./perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> 
>  Summary of events:
> 
>  total, 880 events
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                219     46  2326.400     0.001    10.623
> 243.028    337.77%
>    mq_open               19      8  2001.347     0.003   105.334
> 250.356    117.26%
>    poll                   6      1  1002.512     0.002   167.085
> 1002.496    223.60%
>    clock_nanosleep        1      0  1000.147  1000.147  1000.147
> 1000.147      0.00%
>    swapoff               43      0   953.251     0.001    22.169
> 50.390    112.37%
>    migrate_pages         43      0   933.727     0.004    21.715
> 49.149    106.68%
>    ppoll                 32      0   838.035     0.002    26.189
> 331.222    252.10%
>    epoll_pwait            5      0   499.578     0.001    99.916
> 499.565    199.99%
>    nanosleep              1      0    10.149    10.149    10.149
> 10.149      0.00%
>    epoll_wait            10      0     3.449     0.003     0.345
> 0.815     88.02%
>    readlinkat            25      3     1.424     0.006     0.057
> 0.080     41.76%
>    recvmsg               61      0     1.326     0.016     0.022
> 0.052     21.71%
>    execve                 6      5     1.100     0.002     0.183
> 1.078    218.21%
> 
> I would say stddev here is a little off. The reason is:
> 
> On Mon, Mar 17, 2025 at 11:08 AM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > When -s/--summary option is used, it doesn't need (augmented) arguments
> > of syscalls.  Let's skip the augmentation and load another small BPF
> > program to collect the statistics in the kernel instead of copying the
> > data to the ring-buffer to calculate the stats in userspace.  This will
> > be much more light-weight than the existing approach and remove any lost
> > events.
> >
> > Let's add a new option --bpf-summary to control this behavior.  I cannot
> > make it default because there's no way to get e_machine in the BPF which
> > is needed for detecting different ABIs like 32-bit compat mode.
> >
> > No functional changes intended except for no more LOST events. :)
> >
> >   $ sudo perf trace -as --bpf-summary --summary-mode=total -- sleep 1
> >
> >    Summary of events:
> >
> >    total, 2824 events
> >
> >      syscall            calls  errors  total       min       avg       max       stddev
> >                                        (msec)    (msec)    (msec)    (msec)        (%)
> >      --------------- --------  ------ -------- --------- --------- ---------     ------
> >      futex                372     18  4373.773     0.000    11.757   997.715    660.42%
> >      poll                 241      0  2757.963     0.000    11.444   997.758    580.34%
> >      epoll_wait           161      0  2460.854     0.000    15.285   325.189    260.73%
> >      ppoll                 19      0  1298.652     0.000    68.350   667.172    281.46%
> >      clock_nanosleep        1      0  1000.093     0.000  1000.093  1000.093      0.00%
> >      epoll_pwait           16      0   192.787     0.000    12.049   173.994    348.73%
> >      nanosleep              6      0    50.926     0.000     8.488    10.210     43.96%
> >      ...
> >
> > Cc: Howard Chu <howardchu95@...il.com>
> > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > ---
> > v2)
> >  * rebased on top of Ian's e_machine changes
> >  * add --bpf-summary option
> >  * support per-thread summary
> >  * add stddev calculation  (Howard)
> <SNIP>
> > +static double rel_stddev(struct syscall_stats *stat)
> > +{
> > +       double variance, average;
> > +
> > +       if (stat->count < 2)
> > +               return 0;
> > +
> > +       average = (double)stat->total_time / stat->count;
> > +
> > +       variance = stat->squared_sum;
> > +       variance -= (stat->total_time * stat->total_time) / stat->count;
> > +       variance /= stat->count;
> 
> isn't it 'variance /= stat->count - 1' because we used Bessel's
> correction? (Link:
> https://en.wikipedia.org/wiki/Bessel%27s_correction), that is to use n
> - 1 instead of n, this is what's done in stat.c.
> 
>  *       (\Sum n_i^2) - ((\Sum n_i)^2)/n
>  * s^2 = -------------------------------
>  *                  n - 1
> 
> and the lines down here are unfortunately incorrect
> + variance = stat->squared_sum;
> + variance -= (stat->total_time * stat->total_time) / stat->count;
> + variance /= stat->count;
> +
> + return 100 * sqrt(variance) / average;
> 
> variance /= stat->count - 1; will get you variance, but I think we
> need variance mean.
> Link: https://en.wikipedia.org/wiki/Standard_deviation#Relationship_between_standard_deviation_and_mean
> 
> it holds that:
> variance(mean) = variance / N
> 
> so you are losing a '/ stat->count'

You're right, thanks for pointing that out.

> 
> And with all due respect, although it makes total sense in
> engineering, mathematically, I find variance = stat->squared_sum,
> variance -= ... these accumulated calculations on variable 'variance'
> a little weird... because readers may find difficult to determine at
> which point it becomes the actual 'variance'
> 
> with clarity in mind:
> 
> diff --git a/tools/perf/util/bpf-trace-summary.c
> b/tools/perf/util/bpf-trace-summary.c
> index 5ae9feca244d..a435b4037082 100644
> --- a/tools/perf/util/bpf-trace-summary.c
> +++ b/tools/perf/util/bpf-trace-summary.c
> @@ -62,18 +62,18 @@ struct syscall_node {
> 
>  static double rel_stddev(struct syscall_stats *stat)
>  {
> -       double variance, average;
> +       double variance, average, squared_total;
> 
>         if (stat->count < 2)
>                 return 0;
> 
>         average = (double)stat->total_time / stat->count;
> 
> -       variance = stat->squared_sum;
> -       variance -= (stat->total_time * stat->total_time) / stat->count;
> -       variance /= stat->count;
> +       squared_total = stat->total_time * stat->total_time;
> +       variance = (stat->squared_sum - squared_total / stat->count) /
> (stat->count - 1);
> +       stddev_mean = sqrt(variance / stat->count);
> 
> -       return 100 * sqrt(variance) / average;
> +       return 100 * stddev_mean / average;
>  }

Can it be like this?

diff --git a/tools/perf/util/bpf-trace-summary.c b/tools/perf/util/bpf-trace-summary.c
index a91d42447e850a59..c897fb017914960c 100644
--- a/tools/perf/util/bpf-trace-summary.c
+++ b/tools/perf/util/bpf-trace-summary.c
@@ -71,9 +71,9 @@ static double rel_stddev(struct syscall_stats *stat)
 
        variance = stat->squared_sum;
        variance -= (stat->total_time * stat->total_time) / stat->count;
-       variance /= stat->count;
+       variance /= stat->count - 1;
 
-       return 100 * sqrt(variance) / average;
+       return 100 * sqrt(variance / stat->count) / average;
 }
 
 struct syscall_data {

> 
> btw I haven't checked the legal range for stddev_mean, so I can be wrong.
> <SNIP>
> > +static int update_total_stats(struct hashmap *hash, struct syscall_key *map_key,
> > +                             struct syscall_stats *map_data)
> > +{
> > +       struct syscall_data *data;
> > +       struct syscall_stats *stat;
> > +
> > +       if (!hashmap__find(hash, map_key, &data)) {
> > +               data = zalloc(sizeof(*data));
> > +               if (data == NULL)
> > +                       return -ENOMEM;
> > +
> > +               data->nodes = zalloc(sizeof(*data->nodes));
> > +               if (data->nodes == NULL) {
> > +                       free(data);
> > +                       return -ENOMEM;
> > +               }
> > +
> > +               data->nr_nodes = 1;
> > +               data->key = map_key->nr;
> > +               data->nodes->syscall_nr = data->key;
> Wow, aggressive. I guess you want it to behave like a single value
> when it is SYSCALL_AGGR_CPU, and an array when it is
> SYSCALL_AGGR_THREAD. Do you mind adding a comment about it?
> 
> so it's
> 
> (cpu, syscall_nr) -> data -> {node}
> (tid, syscall_nr) -> data -> [node1, node2, node3]

Right, will add comments.

> 
> 
> > +
> > +               if (hashmap__add(hash, data->key, data) < 0) {
> > +                       free(data->nodes);
> > +                       free(data);
> > +                       return -ENOMEM;
> > +               }
> > +       }
> > +
> > +       /* update total stats for this syscall */
> > +       data->nr_events += map_data->count;
> > +       data->total_time += map_data->total_time;
> > +
> > +       /* This is sum of the same syscall from different CPUs */
> > +       stat = &data->nodes->stats;
> > +
> > +       stat->total_time += map_data->total_time;
> > +       stat->squared_sum += map_data->squared_sum;
> > +       stat->count += map_data->count;
> > +       stat->error += map_data->error;
> > +
> > +       if (stat->max_time < map_data->max_time)
> > +               stat->max_time = map_data->max_time;
> > +       if (stat->min_time > map_data->min_time)
> > +               stat->min_time = map_data->min_time;
> > +
> > +       return 0;
> > +}
> > +
> > +static int print_total_stats(struct syscall_data **data, int nr_data, FILE *fp)
> > +{
> > +       int printed = 0;
> > +       int nr_events = 0;
> > +
> > +       for (int i = 0; i < nr_data; i++)
> > +               nr_events += data[i]->nr_events;
> > +
> > +       printed += fprintf(fp, " total, %d events\n\n", nr_events);
> > +
> > +       printed += fprintf(fp, "   syscall            calls  errors  total       min       avg       max       stddev\n");
> > +       printed += fprintf(fp, "                                     (msec)    (msec)    (msec)    (msec)        (%%)\n");
> > +       printed += fprintf(fp, "   --------------- --------  ------ -------- --------- --------- ---------     ------\n");
> > +
> > +       for (int i = 0; i < nr_data; i++)
> > +               printed += print_common_stats(data[i], fp);
> > +
> > +       printed += fprintf(fp, "\n\n");
> > +       return printed;
> > +}
> > +
> > +int trace_print_bpf_summary(FILE *fp)
> > +{
> > +       struct bpf_map *map = skel->maps.syscall_stats_map;
> > +       struct syscall_key *prev_key, key;
> > +       struct syscall_data **data = NULL;
> > +       struct hashmap schash;
> > +       struct hashmap_entry *entry;
> > +       int nr_data = 0;
> > +       int printed = 0;
> > +       int i;
> > +       size_t bkt;
> > +
> > +       hashmap__init(&schash, sc_node_hash, sc_node_equal, /*ctx=*/NULL);
> > +
> > +       printed = fprintf(fp, "\n Summary of events:\n\n");
> > +
> > +       /* get stats from the bpf map */
> > +       prev_key = NULL;
> > +       while (!bpf_map__get_next_key(map, prev_key, &key, sizeof(key))) {
> > +               struct syscall_stats stat;
> > +
> > +               if (!bpf_map__lookup_elem(map, &key, sizeof(key), &stat, sizeof(stat), 0)) {
> > +                       if (skel->rodata->aggr_mode == SYSCALL_AGGR_THREAD)
> > +                               update_thread_stats(&schash, &key, &stat);
> > +                       else
> > +                               update_total_stats(&schash, &key, &stat);
> > +               }
> > +
> > +               prev_key = &key;
> > +       }
> > +
> > +       nr_data = hashmap__size(&schash);
> > +       data = calloc(nr_data, sizeof(*data));
> > +       if (data == NULL)
> > +               goto out;
> > +
> > +       i = 0;
> > +       hashmap__for_each_entry(&schash, entry, bkt)
> > +               data[i++] = entry->pvalue;
> > +
> > +       qsort(data, nr_data, sizeof(*data), datacmp);
> 
> Here syscall_data is sorted for AGGR_THREAD and AGGR_CPU, meaning the
> thread who has the higher total syscall period will be printed first.
> This is an awesome side effect but it is not the behavior of 'sudo
> ./perf trace -as -- sleep 1' without the --bpf-summary option. If it
> is not too trivial, maybe consider documenting this behavior? But it
> may be too verbose so Idk.

The original behavior is random ordering and now it's ordered by total
time.  But still we can think it's randomly ordered.  I believe we need
to maintain strict ordering by total time and then fixed the existing
code.  Once that happen I can add the documentation.

Thanks for your careful review!
Namhyung

> 
> sudo ./perf trace -as -- sleep 1
> 
>  ClientModuleMan (4956), 16 events, 0.1%
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                  8      4   750.234     0.000    93.779
> 250.105     48.79%
> 
> 
>  CHTTPClientThre (15720), 16 events, 0.1%
> 
>    syscall            calls  errors  total       min       avg
> max       stddev
>                                      (msec)    (msec)    (msec)
> (msec)        (%)
>    --------------- --------  ------ -------- --------- ---------
> ---------     ------
>    futex                  8      4  1000.425     0.000   125.053
> 750.317     75.59%
> 
> The order is random for the command above.
> 
> Thanks,
> Howard