[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170602104048.jkkzssljsompjdwy@suse.de>
Date: Fri, 2 Jun 2017 11:40:48 +0100
From: Mel Gorman <mgorman@...e.de>
To: Jiri Slaby <jslaby@...e.cz>
Cc: Ingo Molnar <mingo@...nel.org>,
Josh Poimboeuf <jpoimboe@...hat.com>, x86@...nel.org,
linux-kernel@...r.kernel.org, live-patching@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andy Lutomirski <luto@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC PATCH 00/10] x86: undwarf unwinder
On Thu, Jun 01, 2017 at 04:08:25PM +0200, Jiri Slaby wrote:
> Ccing Mel who did proper measurements and can hopefully comment on his
> results.
>
> On 06/01/2017, 03:50 PM, Ingo Molnar wrote:
> > That's not what I meant! The speedup comes from (hopefully) being able to disable
> > CONFIG_FRAME_POINTER, which:
> >
> > - creates simpler/faster function prologues and epilogues - no managing of RBP
> > needed
> >
> > - gives one more generic purpose register to work from. This matters less on
> > 64-bit kernels but it's a small effect.
> >
> > I've seen numbers of 1-2% of instruction count reduction in common kernel
> > workloads, which would be pretty significant on well cached workloads.
>
I didn't preserve the data involved but in a variety of workloads including
netperf, page allocator microbenchmark, pgbench and sqlite, enabling
framepointer introduced overhead of around the 5-10% mark. According
to an internal report I gave at the time, hackbench-thread-sockets was
around the 5% mark and a perf run showed "3.49% more cache misses with
framepointer enabled and 6.59% more cycles". Additional notes I made at
the time although again, without the original data is
---8<---
It looks like a small amount of overhead added everywhere and the size of
the vmlinux files supports that
text data bss dec hex filename
8143072 6480614 11153408 25777094 18953c6 vmlinux/decker/vmlinux-4.8.0-disable-fp
8396698 6480614 11153408 26030720 18d3280 vmlinux/decker/vmlinux-4.8.0-enable-fp
I also took a closer look at the pagealloc microbenchmarks because they
rely on so few functions. Profiles were not always captured due to the
short-lived nature of some of the tests so I looked at batches of 16384
allocation/frees of order-0 pages. Overall it showed 4.46% decline with
framepointer enabled and profiling. 3.89% more cycles and 24.94% more
cache misses.
As before, the framepointer cache miss overhead is not that obvious as
the bulk of samples take place elsewhere -- in this case, in checking
whether pages are buddies when merging. It's slightly clearer in
__rmqueue where 17.9% of cache misses are in the function entry point
with framepointer enabled vs 4.04% with framepointer disabled.
---8<---
Granted, the check was done back in 4.8, but I've no reason to believe
that 4.12 is any different and enabling framepointer does have a quite
substantial hit to workloads that spent a lot of time in the kernel.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists