lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <551EC5CD.3050901@redhat.com>
Date:	Fri, 03 Apr 2015 18:54:37 +0200
From:	Denys Vlasenko <dvlasenk@...hat.com>
To:	Ingo Molnar <mingo@...nel.org>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Borislav Petkov <bp@...en8.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Andy Lutomirski <luto@...capital.net>,
	Oleg Nesterov <oleg@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Alexei Starovoitov <ast@...mgrid.com>,
	Will Drewry <wad@...omium.org>,
	Kees Cook <keescook@...omium.org>, x86@...nel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter

On 04/03/2015 04:03 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@...hat.com> wrote:
> 
>> Interrupt entry points are handled with the following code:
>> Each 32-byte code block contains seven entry points
>>
>> 		...
>> 		[push][jump 22] // 4 bytes
>> 		[push][jump 18] // 4 bytes
>> 		[push][jump 14] // 4 bytes
>> 		[push][jump 10] // 4 bytes
>> 		[push][jump  6] // 4 bytes
>> 		[push][jump  2] // 4 bytes
>> 		[push][jump common_interrupt][padding] // 8 bytes
>>
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump]
>> 		[push][jump common_interrupt][padding]
>>
>> 		[padding_2]
>> 	common_interrupt:
>>
>> The first six jumps are short (2-byte insns) jumps to the last
>> jump in the block. The last jump usually has to be a longer, 5-byte form.
>>
>> There are about 30 such blocks.
>>
>> This change uses the fact that last few of these blocks are close to
>> "common_interrupt" label and the last jump there can also be a short one.
>>
>> This allows several last 32-byte code blocks to contain eight, not seven entry points,
>> and to have all these entry points jump directly to common_interrupt,
>> eliminating one branch. They look like this:
>>
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>> 		[push][jump common_interrupt] // 4 bytes
>>
>> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch table
>> and before common_interrupt with 32-byte alignment.
>>
>> The first alignment was completely unnecessary:
>> the dispatch table will not benefit from any alignment bigger than 32 bytes
>> (any entry point needs at most only 32 bytes of table to be read by the CPU).
>>
>> The alignment before common_interrupt label affects the size of padding
>> (the "[padding_2]" above) and excessive padding reduces the number of blocks
>> which can be optimized. In the .config I tested with, this alignment was 64 bytes,
>> this means two fewer blocks could be optimized. I believe 32-byte alignment
>> should do just fine.
>>
>> The condition which controls the choice of a direct jump versus short one,
>>
>>     /* Can we reach common_interrupt with a short jump? */
>>     .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>>
>> was tested to be a passimistic one: the generated code will not erroneously
>> generate a longer than necessary jump even with maximum possible padding
>> and with -1 subtracted from the right hand side above.
>> This was verified by checking disassembly.
>>
>> However, with current value of FIRST_SYSTEM_VECTOR it so happens
>> that the padding before "common_interrupt" is zero bytes.
>>
>>  Disassembly before                           After
>>  ==========================================   ==================================
>> ...                                           ...
>> 840 6a9b           push $0xffffffffffffff9b   82c 6a9d  push $0xffffffffffffff9d
>> 842 eb16           jmp  85a                   82e eb30  jmp  860
>> 844 6a9a           push $0xffffffffffffff9a   830 6a9c  push $0xffffffffffffff9c
>> 846 eb12           jmp  85a                   832 eb2c  jmp  860
>> 848 6a99           push $0xffffffffffffff99   834 6a9b  push $0xffffffffffffff9b
>> 84a eb0e           jmp  85a                   836 eb28  jmp  860
>> 84c 6a98           push $0xffffffffffffff98   838 6a9a  push $0xffffffffffffff9a
>> 84e eb0a           jmp  85a                   83a eb24  jmp  860
>> 850 6a97           push $0xffffffffffffff97   83c 6a99  push $0xffffffffffffff99
>> 852 eb06           jmp  85a                   83e eb20  jmp  860
>> 854 6a96           push $0xffffffffffffff96   840 6a98  push $0xffffffffffffff98
>> 856 eb02           jmp  85a                   842 eb1c  jmp  860
>> 858 6a95           push $0xffffffffffffff95   844 6a97  push $0xffffffffffffff97
>> 85a eb24           jmp  880                   846 eb18  jmp  860
>> 85c 0f1f4000       nopl 0x0(%rax)             848 6a96  push $0xffffffffffffff96
>> 860 6a94           push $0xffffffffffffff94   84a eb14  jmp  860
>> 862 eb0c           jmp  870                   84c 6a95  push $0xffffffffffffff95
>> 864 6a93           push $0xffffffffffffff93   84e eb10  jmp  860
>> 866 eb08           jmp  870                   850 6a94  push $0xffffffffffffff94
>> 868 6a92           push $0xffffffffffffff92   852 eb0c  jmp  860
>> 86a eb04           jmp  870                   854 6a93  push $0xffffffffffffff93
>> 86c 6a91           push $0xffffffffffffff91   856 eb08  jmp  860
>> 86e eb00           jmp  870                   858 6a92  push $0xffffffffffffff92
>> 870 eb0e           jmp  880                   85a eb04  jmp  860
>> 872 66666666662e0f data32 data32 data32 data3285c 6a91  push $0xffffffffffffff91
>> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1)  85e eb00  jmp  860
>> 880 <common_interrupt>:                       860 <common_interrupt>:
>>
>>  Sizes:
>>    text	   data	    bss	    dec	    hex	filename
>>   12578	      0	      0	  12578	   3122	entry_64.o.before
>>   12546	      0	      0	  12546	   3102	entry_64.o
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@...hat.com>
>> CC: Linus Torvalds <torvalds@...ux-foundation.org>
>> CC: Steven Rostedt <rostedt@...dmis.org>
>> CC: Ingo Molnar <mingo@...nel.org>
>> CC: Borislav Petkov <bp@...en8.de>
>> CC: "H. Peter Anvin" <hpa@...or.com>
>> CC: Andy Lutomirski <luto@...capital.net>
>> CC: Oleg Nesterov <oleg@...hat.com>
>> CC: Frederic Weisbecker <fweisbec@...il.com>
>> CC: Alexei Starovoitov <ast@...mgrid.com>
>> CC: Will Drewry <wad@...omium.org>
>> CC: Kees Cook <keescook@...omium.org>
>> CC: x86@...nel.org
>> CC: linux-kernel@...r.kernel.org
>> ---
>>  arch/x86/kernel/entry_64.S | 61 +++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 49 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index da137f9..2003417 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -537,33 +537,70 @@ END(ret_from_fork)
>>   * Build the entry stubs and pointer table with some assembler magic.
>>   * We pack 7 stubs into a single 32-byte chunk, which will fit in a
>>   * single cache line on all modern x86 implementations.
>> + * The last few cachelines pack 8 stubs each.
>>   */
>> +ALIGN_common_interrupt=32
>>  	.section .init.rodata,"a"
>>  ENTRY(interrupt)
>>  	.section .entry.text
>> -	.p2align 5
>> -	.p2align CONFIG_X86_L1_CACHE_SHIFT
>> +	.align 32
>>  ENTRY(irq_entries_start)
>>  	INTR_FRAME
>>  vector=FIRST_EXTERNAL_VECTOR
>> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7
>> -	.balign 32
>> +.rept 256 /* this number does not need to be exact, just big enough */
>> +
>> +  /*
>> +   * Block of six "push + short_jump" pairs, 4 bytes each,
>> +   * and a 2-byte seventh "push", without its jump yet.
>> +   */
>> +need_near_jump=0
>> +push_count=0
>>    .rept	7
>>      .if vector < FIRST_SYSTEM_VECTOR
>> -      .if vector <> FIRST_EXTERNAL_VECTOR
>> +1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> +	.previous
>> +	.quad 1b
>> +	.section .entry.text
>> +vector=vector+1
>> +push_count=push_count+1
>> +      .if push_count < 7
>> +	/* Can we reach common_interrupt with a short jump? */
>> +	.if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
>> +	  jmp common_interrupt  /* yes */
>> +	.else
>> +	  jmp 2f
>> +need_near_jump=1
>> +	.endif
>>  	CFI_ADJUST_CFA_OFFSET -8
>>        .endif
>> +    .endif
>> +  .endr
>> +
>> +  /* If we are close to the end, we can pack in yet another pair */
>> +  .if need_near_jump == 0
>> +    .if push_count == 7
>> +	/* The "short jump" for the seventh "push" */
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +    .endif
>> +    .if vector < FIRST_SYSTEM_VECTOR
>> +	/* "push + short_jump" pair #8 */
>>  1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
>> -      .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6
>> -	jmp 2f
>> -      .endif
>> -      .previous
>> +	.previous
>>  	.quad 1b
>> -      .section .entry.text
>> +	.section .entry.text
>>  vector=vector+1
>> +push_count=push_count+1
>> +	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>>      .endif
>> -  .endr
>> +  .else
>> +	/* Use remaining space for "near jump" for the seventh "push" */
>>  2:	jmp common_interrupt
>> +	CFI_ADJUST_CFA_OFFSET -8
>> +	.align 2
>> +  .endif
>> +
>>  .endr
>>  	CFI_ENDPROC
>>  END(irq_entries_start)
> 
> So I think this might be easier to read if we went low tech and used 
> an open-coded 240+ lines assembly file for this, with a few 
> obvious-purpose helper macros? It's not like it will change all that 
> often so it only has to be generated once.


How about this version?
It's still isn't a star of readability,
but the structure of the 32-byte code block is more visible now...


/*
 * Build the entry stubs and pointer table with some assembler magic.
 * We pack 7 stubs into a single 32-byte chunk, which will fit in a
 * single cache line on all modern x86 implementations.
 * The last few cachelines pack 8 stubs each.
 */
ALIGN_common_interrupt=32
	.section .init.rodata,"a"
ENTRY(interrupt)
	.section .entry.text
	.align 32
ENTRY(irq_entries_start)
	INTR_FRAME

	.macro push_vector
    .if vector < FIRST_SYSTEM_VECTOR
1:	pushq_cfi $(~vector+0x80)	/* Note: always in signed byte range */
	.previous
	.quad 1b
	.section .entry.text
need_jump=1
vector=vector+1
    .endif
	.endm

	.macro short_jump label
    .if need_jump == 1
	/* Can we reach common_interrupt with a short jump? */
    .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4)
	jmp	common_interrupt  /* yes */
    .else
	jmp	\label
need_near_jump=1
    .endif
	CFI_ADJUST_CFA_OFFSET -8
need_jump=0
    .endif
	.endm

	.macro near_jump label
	jmp	\label
	CFI_ADJUST_CFA_OFFSET -8
	.endm

vector=FIRST_EXTERNAL_VECTOR
.rept 256 /* this number does not need to be exact, just big enough */
need_near_jump=0
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
	push_vector; short_jump 2f
    .if need_near_jump == 1
	push_vector; 2:	near_jump common_interrupt; .align 2
    .else
	push_vector; short_jump common_interrupt
	push_vector; short_jump common_interrupt
    .endif
.endr
        CFI_ENDPROC
END(irq_entries_start)

.previous
END(interrupt)
.previous

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ