linux-kernel - Re: [PATCH v1] tools api io: Move filling the io buffer to its own function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAP-5=fV9a2CrM0suQ7_VgG=6DoAAi+A6JyTK=kVivNmSZmg7aQ@mail.gmail.com>
Date: Thu, 30 May 2024 10:01:47 -0700
From: Ian Rogers <irogers@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Arnaldo Carvalho de Melo <acme@...hat.com>, linux-kernel@...r.kernel.org, 
	linux-perf-users@...r.kernel.org
Subject: Re: [PATCH v1] tools api io: Move filling the io buffer to its own function

On Thu, May 30, 2024 at 9:44 AM Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Thu, May 23, 2024 at 9:47 PM Ian Rogers <irogers@...gle.com> wrote:
> >
> > On Thu, May 23, 2024 at 4:25 PM Namhyung Kim <namhyung@...nel.org> wrote:
> > >
> > > On Sun, May 19, 2024 at 11:17 AM Ian Rogers <irogers@...gle.com> wrote:
> > > >
> > > > In general a read fills 4kb so filling the buffer is a 1 in 4096
> > > > operation, move it out of the io__get_char function to avoid some
> > > > checking overhead and to better hint the function is good to inline.
> > > >
> > > > For perf's IO intensive internal (non-rigorous) benchmarks there's a
> > > > near 8% improvement to kallsyms-parsing with a default build.
> > >
> > > Oh, is it just from removing the io->eof check?  Otherwise I don't
> > > see any difference.
> >
> > I was hoping that by moving the code out-of-line then the hot part of
> > the function could be inlined into things like reading the hex
> > character. I didn't see that, presumably there are too many callers
> > and so that made the inliner think sharing would be best even though
> > the hot code is a compare, pointer dereference and an increment. I
> > tried forcing inlining but it didn't seem to win over just having the
> > code out-of-line. The eof check should be very well predicted. The
> > out-of-line code was branched over forward, which should be 1
> > mispredict but again not a huge deal. I didn't do a more thorough
> > analysis as I still prefer to have the cold code out-of-line.
>
> Ok, I don't see much difference with this change.  But the change itself
> looks fine.
>
> Thanks,
> Namhyung
>
>
> Before:
>
> # Running internals/synthesize benchmark...
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 237.274 usec (+- 0.066 usec)
>   Average num. events: 24.000 (+- 0.000)
>   Average time per event 9.886 usec
>   Average data synthesis took: 241.126 usec (+- 0.087 usec)
>   Average num. events: 128.000 (+- 0.000)
>   Average time per event 1.884 usec
>
> # Running internals/kallsyms-parse benchmark...
>   Average kallsyms__parse took: 184.374 ms (+- 0.022 ms)
>
> # Running internals/inject-build-id benchmark...
>   Average build-id injection took: 20.096 msec (+- 0.115 msec)
>   Average time per event: 1.970 usec (+- 0.011 usec)
>   Average memory usage: 11574 KB (+- 29 KB)
>   Average build-id-all injection took: 13.477 msec (+- 0.100 msec)
>   Average time per event: 1.321 usec (+- 0.010 usec)
>   Average memory usage: 11160 KB (+- 0 KB)
>
> # Running internals/evlist-open-close benchmark...
>   Number of cpus:    64
>   Number of threads:    1
>   Number of events:    1 (64 fds)
>   Number of iterations:    100
> evlist__open: Permission denied
>
> # Running internals/pmu-scan benchmark...
> Computing performance of sysfs PMU event scan for 100 times
>   Average core PMU scanning took: 135.880 usec (+- 0.249 usec)
>   Average PMU scanning took: 816.745 usec (+- 48.293 usec)
>
>
> After:
>
> # Running internals/synthesize benchmark...
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 235.711 usec (+- 0.067 usec)
>   Average num. events: 24.000 (+- 0.000)
>   Average time per event 9.821 usec
>   Average data synthesis took: 240.992 usec (+- 0.058 usec)
>   Average num. events: 128.000 (+- 0.000)
>   Average time per event 1.883 usec
>
> # Running internals/kallsyms-parse benchmark...
>   Average kallsyms__parse took: 179.664 ms (+- 0.043 ms)

So this is still 2%. I was building without options like DEBUG=1
enabled, so perhaps that'd explain the difference. Anyway, if you're
more comfortable with a commit message saying a 2% performance win I
don't mind it being updated or I can upload a v2. It's likely this is
being over-thought given the change :-)

Thanks,
Ian

> # Running internals/inject-build-id benchmark...
>   Average build-id injection took: 19.901 msec (+- 0.117 msec)
>   Average time per event: 1.951 usec (+- 0.011 usec)
>   Average memory usage: 12163 KB (+- 10 KB)
>   Average build-id-all injection took: 13.627 msec (+- 0.086 msec)
>   Average time per event: 1.336 usec (+- 0.008 usec)
>   Average memory usage: 11160 KB (+- 0 KB)
>
> # Running internals/evlist-open-close benchmark...
>   Number of cpus:    64
>   Number of threads:    1
>   Number of events:    1 (64 fds)
>   Number of iterations:    100
> evlist__open: Permission denied
>
> # Running internals/pmu-scan benchmark...
> Computing performance of sysfs PMU event scan for 100 times
>   Average core PMU scanning took: 136.540 usec (+- 0.294 usec)
>   Average PMU scanning took: 819.415 usec (+- 48.437 usec)