linux-kernel - Re: [PATCH v1] tools api io: Move filling the io buffer to its own function

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM9d7ch7GidVscUxQh+CZRuo1U1KiDiFiuOZN3FnqbrPDbaVNw@mail.gmail.com>
Date: Thu, 30 May 2024 09:44:02 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Ian Rogers <irogers@...gle.com>
Cc: Arnaldo Carvalho de Melo <acme@...hat.com>, linux-kernel@...r.kernel.org, 
	linux-perf-users@...r.kernel.org
Subject: Re: [PATCH v1] tools api io: Move filling the io buffer to its own function

On Thu, May 23, 2024 at 9:47 PM Ian Rogers <irogers@...gle.com> wrote:
>
> On Thu, May 23, 2024 at 4:25 PM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > On Sun, May 19, 2024 at 11:17 AM Ian Rogers <irogers@...gle.com> wrote:
> > >
> > > In general a read fills 4kb so filling the buffer is a 1 in 4096
> > > operation, move it out of the io__get_char function to avoid some
> > > checking overhead and to better hint the function is good to inline.
> > >
> > > For perf's IO intensive internal (non-rigorous) benchmarks there's a
> > > near 8% improvement to kallsyms-parsing with a default build.
> >
> > Oh, is it just from removing the io->eof check?  Otherwise I don't
> > see any difference.
>
> I was hoping that by moving the code out-of-line then the hot part of
> the function could be inlined into things like reading the hex
> character. I didn't see that, presumably there are too many callers
> and so that made the inliner think sharing would be best even though
> the hot code is a compare, pointer dereference and an increment. I
> tried forcing inlining but it didn't seem to win over just having the
> code out-of-line. The eof check should be very well predicted. The
> out-of-line code was branched over forward, which should be 1
> mispredict but again not a huge deal. I didn't do a more thorough
> analysis as I still prefer to have the cold code out-of-line.

Ok, I don't see much difference with this change.  But the change itself
looks fine.

Thanks,
Namhyung

Before:

# Running internals/synthesize benchmark...
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 237.274 usec (+- 0.066 usec)
  Average num. events: 24.000 (+- 0.000)
  Average time per event 9.886 usec
  Average data synthesis took: 241.126 usec (+- 0.087 usec)
  Average num. events: 128.000 (+- 0.000)
  Average time per event 1.884 usec

# Running internals/kallsyms-parse benchmark...
  Average kallsyms__parse took: 184.374 ms (+- 0.022 ms)

# Running internals/inject-build-id benchmark...
  Average build-id injection took: 20.096 msec (+- 0.115 msec)
  Average time per event: 1.970 usec (+- 0.011 usec)
  Average memory usage: 11574 KB (+- 29 KB)
  Average build-id-all injection took: 13.477 msec (+- 0.100 msec)
  Average time per event: 1.321 usec (+- 0.010 usec)
  Average memory usage: 11160 KB (+- 0 KB)

# Running internals/evlist-open-close benchmark...
  Number of cpus:    64
  Number of threads:    1
  Number of events:    1 (64 fds)
  Number of iterations:    100
evlist__open: Permission denied

# Running internals/pmu-scan benchmark...
Computing performance of sysfs PMU event scan for 100 times
  Average core PMU scanning took: 135.880 usec (+- 0.249 usec)
  Average PMU scanning took: 816.745 usec (+- 48.293 usec)

After:

# Running internals/synthesize benchmark...
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 235.711 usec (+- 0.067 usec)
  Average num. events: 24.000 (+- 0.000)
  Average time per event 9.821 usec
  Average data synthesis took: 240.992 usec (+- 0.058 usec)
  Average num. events: 128.000 (+- 0.000)
  Average time per event 1.883 usec

# Running internals/kallsyms-parse benchmark...
  Average kallsyms__parse took: 179.664 ms (+- 0.043 ms)

# Running internals/inject-build-id benchmark...
  Average build-id injection took: 19.901 msec (+- 0.117 msec)
  Average time per event: 1.951 usec (+- 0.011 usec)
  Average memory usage: 12163 KB (+- 10 KB)
  Average build-id-all injection took: 13.627 msec (+- 0.086 msec)
  Average time per event: 1.336 usec (+- 0.008 usec)
  Average memory usage: 11160 KB (+- 0 KB)

# Running internals/evlist-open-close benchmark...
  Number of cpus:    64
  Number of threads:    1
  Number of events:    1 (64 fds)
  Number of iterations:    100
evlist__open: Permission denied

# Running internals/pmu-scan benchmark...
Computing performance of sysfs PMU event scan for 100 times
  Average core PMU scanning took: 136.540 usec (+- 0.294 usec)
  Average PMU scanning took: 819.415 usec (+- 48.437 usec)