lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 21 Feb 2012 18:53:35 -0800
From:	Jason Garrett-Glaser <jason@...4.com>
To:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc:	rostedt@...dmis.org, tglx@...utronix.de, mingo@...hat.com,
	hpa@...or.com, x86@...nel.org, linux-kernel@...r.kernel.org,
	xen-devel@...ts.xensource.com
Subject: Re: [PATCH] x86 fixes for 3.3 impacting distros (v1).

On Fri, Feb 10, 2012 at 7:34 AM, Konrad Rzeszutek Wilk
<konrad.wilk@...cle.com> wrote:
> The attached patch fixes RH BZ #742032, #787403, and #745574
> and touch x86 subsystem.
>
> The patch description gives a very good overview of the problem and
> one solution. The one solution it chooses is not the most architecturally
> sound but it does not cause performance degradation. If this your
> first time reading this, please read the patch first and then come back to
> this cover letter as I've some perf numbers and more detailed explanation here.
>
> A bit of overview of the __page_change_attr_set_clr:
>
> Its purpose is to change page attributes from one type to another.
> It is important to understand that the entrance that code:
> __page_change_attr_set_clr is guarded by cpa_lock spin-lock - which makes
> that whole code be single threaded.
>
> Albeit it seems that if debug mode is turned on, it can run in parallel. The
> effect of using the posted patch is that __page_change_attr_set_clr() will be
> affected when we change caching attributes on 4KB pages and/or the NX flag.
>
> The execution of __page_change_attr_set_clr is concentrated in
> (looked for ioremap_* and set_pages_*):
>  - during bootup ("Write protecting the ..")
>  - suspend/resume and graphic adapters evicting their buffers from the card
>   to RAM (which is usually done during suspend but can be done via the
>   'evict' attribute in debugfs)
>  - when setting the memory for the cursor (AGP cards using i8xx chipset) -
>   done during bootup and startup of Xserver.
>  - setting up memory for Intel GTT scratch (i9xx) page (done during bootup)
>  - payload (purgatory code) for kexec (done during kexec -l).
>  - ioremap_* during PCI devices load - InfiniBand and video cards like to use
>   ioremap_wc.
>  - Intel, radeon, nouveau running into memory pressure and evicting pages from
>   their GEM/TTM pool (once an hour or so if compiling a lot with only 4GB).
>
> These are the cases I found when running on baremetal (and Xen) using a normal
> Fedora Core 16 distro.
>
> The alternate solution to the problem I am trying to solve, which is much
> more architecturally sound (but has some perf disadvantages) is to wrap
> the pte_flags with paravirt call everywhere. For that these patches two patches:
> http://darnok.org/results/baseline_pte_flags_pte_attrs/0001-x86-paravirt-xen-Introduce-pte_flags.patch
> http://darnok.org/results/baseline_pte_flags_pte_attrs/0002-x86-paravirt-xen-Optimize-pte_flags-by-marking-it-as.patch
>
> make the pte_flags function (after bootup and patching with alternative asm)
> look as so:
>
>   48 89 f8                     mov    %rdi,%rax
>   66 66 66 90                  data32 data32 xchg %ax,%ax
>
> [the 66 66 .. is 'nop']. Looks good right? Well, it does work very well on Intel
> (used an i3 2100), but on AMD A8-3850 it hits a performance wall - that I found out
> is a result of CONFIG_FUNCTION_TRACER (too many nops??) being compiled in (but the tracer
> is set to the default 'nop'). If I disable that specific config option the numbers
> are the same as the baseline (with CONFIG_FUNCTION_TRACER disabled) on the AMD box.
> Interestingly enough I only see these on AMD machines - not on the Intel ones.

The AMD software optimization manual says that -- at least on some
chips -- too many prefixes forces the instruction decoder into a slow
mode (basically microcoded) where it takes literally dozens of cycles
for a single instruction.  I believe more than 2 prefixes will do
this; check the manual itself for specifics.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ