[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM9d7ch7GidVscUxQh+CZRuo1U1KiDiFiuOZN3FnqbrPDbaVNw@mail.gmail.com>
Date: Thu, 30 May 2024 09:44:02 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Arnaldo Carvalho de Melo <acme@...hat.com>, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org
Subject: Re: [PATCH v1] tools api io: Move filling the io buffer to its own function
On Thu, May 23, 2024 at 9:47 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Thu, May 23, 2024 at 4:25 PM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > On Sun, May 19, 2024 at 11:17 AM Ian Rogers <irogers@...gle.com> wrote:
> > >
> > > In general a read fills 4kb so filling the buffer is a 1 in 4096
> > > operation, move it out of the io__get_char function to avoid some
> > > checking overhead and to better hint the function is good to inline.
> > >
> > > For perf's IO intensive internal (non-rigorous) benchmarks there's a
> > > near 8% improvement to kallsyms-parsing with a default build.
> >
> > Oh, is it just from removing the io->eof check? Otherwise I don't
> > see any difference.
>
> I was hoping that by moving the code out-of-line then the hot part of
> the function could be inlined into things like reading the hex
> character. I didn't see that, presumably there are too many callers
> and so that made the inliner think sharing would be best even though
> the hot code is a compare, pointer dereference and an increment. I
> tried forcing inlining but it didn't seem to win over just having the
> code out-of-line. The eof check should be very well predicted. The
> out-of-line code was branched over forward, which should be 1
> mispredict but again not a huge deal. I didn't do a more thorough
> analysis as I still prefer to have the cold code out-of-line.
Ok, I don't see much difference with this change. But the change itself
looks fine.
Thanks,
Namhyung
Before:
# Running internals/synthesize benchmark...
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 237.274 usec (+- 0.066 usec)
Average num. events: 24.000 (+- 0.000)
Average time per event 9.886 usec
Average data synthesis took: 241.126 usec (+- 0.087 usec)
Average num. events: 128.000 (+- 0.000)
Average time per event 1.884 usec
# Running internals/kallsyms-parse benchmark...
Average kallsyms__parse took: 184.374 ms (+- 0.022 ms)
# Running internals/inject-build-id benchmark...
Average build-id injection took: 20.096 msec (+- 0.115 msec)
Average time per event: 1.970 usec (+- 0.011 usec)
Average memory usage: 11574 KB (+- 29 KB)
Average build-id-all injection took: 13.477 msec (+- 0.100 msec)
Average time per event: 1.321 usec (+- 0.010 usec)
Average memory usage: 11160 KB (+- 0 KB)
# Running internals/evlist-open-close benchmark...
Number of cpus: 64
Number of threads: 1
Number of events: 1 (64 fds)
Number of iterations: 100
evlist__open: Permission denied
# Running internals/pmu-scan benchmark...
Computing performance of sysfs PMU event scan for 100 times
Average core PMU scanning took: 135.880 usec (+- 0.249 usec)
Average PMU scanning took: 816.745 usec (+- 48.293 usec)
After:
# Running internals/synthesize benchmark...
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 235.711 usec (+- 0.067 usec)
Average num. events: 24.000 (+- 0.000)
Average time per event 9.821 usec
Average data synthesis took: 240.992 usec (+- 0.058 usec)
Average num. events: 128.000 (+- 0.000)
Average time per event 1.883 usec
# Running internals/kallsyms-parse benchmark...
Average kallsyms__parse took: 179.664 ms (+- 0.043 ms)
# Running internals/inject-build-id benchmark...
Average build-id injection took: 19.901 msec (+- 0.117 msec)
Average time per event: 1.951 usec (+- 0.011 usec)
Average memory usage: 12163 KB (+- 10 KB)
Average build-id-all injection took: 13.627 msec (+- 0.086 msec)
Average time per event: 1.336 usec (+- 0.008 usec)
Average memory usage: 11160 KB (+- 0 KB)
# Running internals/evlist-open-close benchmark...
Number of cpus: 64
Number of threads: 1
Number of events: 1 (64 fds)
Number of iterations: 100
evlist__open: Permission denied
# Running internals/pmu-scan benchmark...
Computing performance of sysfs PMU event scan for 100 times
Average core PMU scanning took: 136.540 usec (+- 0.294 usec)
Average PMU scanning took: 819.415 usec (+- 48.437 usec)
Powered by blists - more mailing lists