linux-kernel - Re: [PATCHv4 14/14] x86/mm: Offset boot-time paging mode switching cost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170817092150.imggg3pjguagvudd@gmail.com>
Date:   Thu, 17 Aug 2017 11:21:50 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>, x86@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>, Andi Kleen <ak@...ux.intel.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Andy Lutomirski <luto@...capital.net>,
        Michal Hocko <mhocko@...nel.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCHv4 14/14] x86/mm: Offset boot-time paging mode switching
 cost


* Kirill A. Shutemov <kirill.shutemov@...ux.intel.com> wrote:

> By this point we have functioning boot-time switching between 4- and
> 5-level paging mode. But naive approach comes with cost.
> 
> Numbers below are for kernel build, allmodconfig, 5 times.
> 
> CONFIG_X86_5LEVEL=n:
> 
>  Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
> 
>    17308719.892691      task-clock:u (msec)       #   26.772 CPUs utilized            ( +-  0.11% )
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>        331,993,164      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
> 43,614,978,867,455      cycles:u                  #    2.520 GHz                      ( +-  0.01% )
> 39,371,534,575,126      stalled-cycles-frontend:u #   90.27% frontend cycles idle     ( +-  0.09% )
> 28,363,350,152,428      instructions:u            #    0.65  insn per cycle
>                                                   #    1.39  stalled cycles per insn  ( +-  0.00% )
>  6,316,784,066,413      branches:u                #  364.948 M/sec                    ( +-  0.00% )
>    250,808,144,781      branch-misses:u           #    3.97% of all branches          ( +-  0.01% )
> 
>      646.531974142 seconds time elapsed                                          ( +-  1.15% )
> 
> CONFIG_X86_5LEVEL=y:
> 
>  Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
> 
>    17411536.780625      task-clock:u (msec)       #   26.426 CPUs utilized            ( +-  0.10% )
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>        331,868,663      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
> 43,865,909,056,301      cycles:u                  #    2.519 GHz                      ( +-  0.01% )
> 39,740,130,365,581      stalled-cycles-frontend:u #   90.59% frontend cycles idle     ( +-  0.05% )
> 28,363,358,997,959      instructions:u            #    0.65  insn per cycle
>                                                   #    1.40  stalled cycles per insn  ( +-  0.00% )
>  6,316,784,937,460      branches:u                #  362.793 M/sec                    ( +-  0.00% )
>    251,531,919,485      branch-misses:u           #    3.98% of all branches          ( +-  0.00% )
> 
>      658.886307752 seconds time elapsed                                          ( +-  0.92% )
> The patch tries to fix the performance regression by using
> 
> !cpu_feature_enabled(X86_FEATURE_LA57) instead of p4d_folded in all hot
> code paths. These will statically patch the target code for additional
> performance.
> 
> Also, I had to re-write number of static inline helpers as macros.
> It was needed to break header dependency loop between cpufeature.h and
> pgtable_types.h.
> 
> CONFIG_X86_5LEVEL=y + the patch:
> 
>  Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
> 
>    17381990.268506      task-clock:u (msec)       #   26.907 CPUs utilized            ( +-  0.19% )
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>        331,862,625      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
> 43,697,726,320,051      cycles:u                  #    2.514 GHz                      ( +-  0.03% )
> 39,480,408,690,401      stalled-cycles-frontend:u #   90.35% frontend cycles idle     ( +-  0.05% )
> 28,363,394,221,388      instructions:u            #    0.65  insn per cycle
>                                                   #    1.39  stalled cycles per insn  ( +-  0.00% )
>  6,316,794,985,573      branches:u                #  363.410 M/sec                    ( +-  0.00% )
>    251,013,232,547      branch-misses:u           #    3.97% of all branches          ( +-  0.01% )
> 
>      645.991174661 seconds time elapsed                                          ( +-  1.19% )

Ok - these measurements are very nice and address many of my worries about earlier 
parts of the series.

Anyway, please split this patch up some more as well (as any of the optimizations 
could regress by themselves), and my renaming suggestions still stand as well.

> @@ -11,6 +11,11 @@
>  #undef CONFIG_PARAVIRT_SPINLOCKS
>  #undef CONFIG_KASAN
>  
> +#ifdef CONFIG_X86_5LEVEL
> +/* cpu_feature_enabled() cannot be used that early */
> +#define p4d_folded __p4d_folded
> +#endif
> +
>  #include <linux/linkage.h>
>  #include <linux/screen_info.h>
>  #include <linux/elf.h>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 077e8b45784c..702a1feb4991 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -274,15 +274,8 @@ return_from_SYSCALL_64:
>  	 * depending on paging mode) in the address.
>  	 */
>  #ifdef CONFIG_X86_5LEVEL
> -	testl	$1, p4d_folded(%rip)
> -	jnz	1f
> -	shl	$(64 - 57), %rcx
> -	sar	$(64 - 57), %rcx
> -	jmp	2f
> -1:
> -	shl	$(64 - 48), %rcx
> -	sar	$(64 - 48), %rcx
> -2:
> +	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> +		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57

Ignore my earlier suggestion to use alternatives, you already implemented it!
This is what I get for replying to a patch series in chronological order. ;-)

I suspect the syscall overhead was the main reason for the performance regression.

Thanks,

	Ingo