lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <73c1f2160812270916i1f43cbeave955a434b17491fb@mail.gmail.com>
Date:	Sat, 27 Dec 2008 12:16:29 -0500
From:	"Brian Gerst" <brgerst@...il.com>
To:	"Ingo Molnar" <mingo@...e.hu>
Cc:	"Christoph Lameter" <cl@...ux-foundation.org>,
	"Thomas Gleixner" <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	"Jeremy Fitzhardinge" <jeremy@...p.org>,
	"Alexander van Heukelum" <heukelum@...lshack.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] x86-64: Convert the PDA to percpu.

On Sat, Dec 27, 2008 at 10:53 AM, Ingo Molnar <mingo@...e.hu> wrote:
>
> * Brian Gerst <brgerst@...il.com> wrote:
>
>> On Sat, Dec 27, 2008 at 5:41 AM, Ingo Molnar <mingo@...e.hu> wrote:
>> >
>> > (Cc:-ed a few more people who might be interested in this)
>> >
>> > * Brian Gerst <brgerst@...il.com> wrote:
>> >
>> >> This patch makes the PDA a normal per-cpu variable, allowing the
>> >> removal of the special allocator code.  %gs still points to the
>> >> base of the PDA.
>> >>
>> >> Tested on a dual-core AMD64 system.
>> >>
>> >> Signed-off-by: Brian Gerst <brgerst@...il.com>
>> >> ---
>> >>  arch/x86/include/asm/pda.h     |    3 --
>> >>  arch/x86/include/asm/percpu.h  |    3 --
>> >>  arch/x86/include/asm/setup.h   |    1 -
>> >>  arch/x86/kernel/cpu/common.c   |    6 ++--
>> >>  arch/x86/kernel/dumpstack_64.c |    8 ++--
>> >>  arch/x86/kernel/head64.c       |   23 +------------
>> >>  arch/x86/kernel/irq.c          |    2 +-
>> >>  arch/x86/kernel/nmi.c          |    2 +-
>> >>  arch/x86/kernel/setup_percpu.c |   70 ++++++++--------------------------------
>> >>  arch/x86/kernel/smpboot.c      |   58 +--------------------------------
>> >>  arch/x86/xen/enlighten.c       |    2 +-
>> >>  arch/x86/xen/smp.c             |   12 +------
>> >>  12 files changed, 27 insertions(+), 163 deletions(-)
>> >
>> > the simplification factor is significant. I'm wondering, have you measured
>> > the code size impact of this on say the defconfig x86 kernel? That will
>> > generally tell us how much worse optimizations the compiler does under
>> > this scheme.
>> >
>> >        Ingo
>> >
>>
>> Patch #1 by itself doesn't change how the PDA is accessed, only how it
>> is allocated.  The text size goes down significantly with patch #1,
>> but data goes up.  Changing the PDA to cacheline-aligned (1a) brings
>> it back in line.
>>
>>    text          data     bss     dec     hex filename
>> 7033648       1754476  758508 9546632  91ab88 vmlinux.0   (vanilla 2.6.28)
>> 7029563       1758428  758508 9546499  91ab03 vmlinux.1   (with patch #1)
>> 7029563       1754460  758508 9542531  919b83 vmlinux.1a  (with patch #1 cache align)
>> 7036694       1758428  758508 9553630  91c6de vmlinux.3   (with all three patches)
>>
>> I think the first patch (with the alignment fix) is a clear win.  As for
>> the other patches, they add about 8 bytes per use of a PDA variable.
>> cpu_number is used 903 times in this compile, so this is likely the most
>> extreme example.  I have an idea to optimize this particular case
>> further that I'd like to look at which would lessen the impact.
>
> curious, what idea is that?
>
>        Ingo
>

Something like this:
+#define raw_smp_processor_id()                                         \
+({                                                                     \
+       extern int gsoff__cpu_number;                                   \
+       int cpu;                                                        \
+       __asm__("movl %%gs:%1, %0" : "=r" (cpu)                         \
+                                  : "m" (gsoff__cpu_number);           \
+       cpu;                                                            \
+})

And add this to vmlinux_64.lds.S:
+#define GSOFF(x) gsoff__##x = per_cpu__##x - per_cpu__pda
+  GSOFF(cpu_number);

The trick is that the linker can calculate against multiple symbols,
but it must be done in the final link.  The problem with this approach
is that only a limited set of symbols can be used.  There isn't a
simple solution for all per-cpu variables.  Some post-processing would
have to be done, similar to kallsyms.

Looking some more at the usage statistics of the PDA members, there
are four heavy hitters:
pda->pcurrent (2719)
pda->kernelstack (1055)
pda->cpunumber (933)
pda->data_offset (327)

The rest of the PDA members have an insignificant number of accesses.
I think for now I'll avoid converting the above four fields until an
optimal solution can be agreed on, but the others (primarily the TLB
and irqstat fields) can be converted without bloating the kernel code
alot.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ