linux-kernel - Re: [RESEND Patch 2/2] Doc: move Documentation/exception.txt into x86 subdir

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A492A29.9090703@oracle.com>
Date:	Mon, 29 Jun 2009 13:55:05 -0700
From:	Randy Dunlap <randy.dunlap@...cle.com>
To:	Amerigo Wang <amwang@...hat.com>
CC:	linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
	jaswinder@...nel.org, mingo@...e.hu
Subject: Re: [RESEND Patch 2/2] Doc: move Documentation/exception.txt into
 x86 subdir

Amerigo Wang wrote:
> exception.txt only explains the code on x86, so it's better to
> move it into Documentation/x86 directory.
> 
> And also rename it to exception-tables.txt which looks much
> more reasonable.
> 
> This patch is on top of the previous one.
> 
> Signed-off-by: WANG Cong <amwang@...hat.com>
> Cc: Randy Dunlap <randy.dunlap@...cle.com>

Acked-by: Randy Dunlap <randy.dunlap@...cle.com>


> Cc: Ingo Molnar <mingo@...e.hu>
> Cc: jaswinder@...nel.org
> 
> ---
> Index: linux-2.6/Documentation/exception.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/exception.txt
> +++ /dev/null
> @@ -1,292 +0,0 @@
> -     Kernel level exception handling in Linux
> -  Commentary by Joerg Pommnitz <joerg@...eigh.ibm.com>
> -
> -When a process runs in kernel mode, it often has to access user
> -mode memory whose address has been passed by an untrusted program.
> -To protect itself the kernel has to verify this address.
> -
> -In older versions of Linux this was done with the
> -int verify_area(int type, const void * addr, unsigned long size)
> -function (which has since been replaced by access_ok()).
> -
> -This function verified that the memory area starting at address
> -'addr' and of size 'size' was accessible for the operation specified
> -in type (read or write). To do this, verify_read had to look up the
> -virtual memory area (vma) that contained the address addr. In the
> -normal case (correctly working program), this test was successful.
> -It only failed for a few buggy programs. In some kernel profiling
> -tests, this normally unneeded verification used up a considerable
> -amount of time.
> -
> -To overcome this situation, Linus decided to let the virtual memory
> -hardware present in every Linux-capable CPU handle this test.
> -
> -How does this work?
> -
> -Whenever the kernel tries to access an address that is currently not
> -accessible, the CPU generates a page fault exception and calls the
> -page fault handler
> -
> -void do_page_fault(struct pt_regs *regs, unsigned long error_code)
> -
> -in arch/x86/mm/fault.c. The parameters on the stack are set up by
> -the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
> -regs is a pointer to the saved registers on the stack, error_code
> -contains a reason code for the exception.
> -
> -do_page_fault first obtains the unaccessible address from the CPU
> -control register CR2. If the address is within the virtual address
> -space of the process, the fault probably occurred, because the page
> -was not swapped in, write protected or something similar. However,
> -we are interested in the other case: the address is not valid, there
> -is no vma that contains this address. In this case, the kernel jumps
> -to the bad_area label.
> -
> -There it uses the address of the instruction that caused the exception
> -(i.e. regs->eip) to find an address where the execution can continue
> -(fixup). If this search is successful, the fault handler modifies the
> -return address (again regs->eip) and returns. The execution will
> -continue at the address in fixup.
> -
> -Where does fixup point to?
> -
> -Since we jump to the contents of fixup, fixup obviously points
> -to executable code. This code is hidden inside the user access macros.
> -I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
> -as an example. The definition is somewhat hard to follow, so let's peek at
> -the code generated by the preprocessor and the compiler. I selected
> -the get_user call in drivers/char/sysrq.c for a detailed examination.
> -
> -The original code in sysrq.c line 587:
> -        get_user(c, buf);
> -
> -The preprocessor output (edited to become somewhat readable):
> -
> -(
> -  {
> -    long __gu_err = - 14 , __gu_val = 0;
> -    const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
> -    if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
> -       (((sizeof(*(buf))) <= 0xC0000000UL) &&
> -       ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
> -      do {
> -        __gu_err  = 0;
> -        switch ((sizeof(*(buf)))) {
> -          case 1:
> -            __asm__ __volatile__(
> -              "1:      mov" "b" " %2,%" "b" "1\n"
> -              "2:\n"
> -              ".section .fixup,\"ax\"\n"
> -              "3:      movl %3,%0\n"
> -              "        xor" "b" " %" "b" "1,%" "b" "1\n"
> -              "        jmp 2b\n"
> -              ".section __ex_table,\"a\"\n"
> -              "        .align 4\n"
> -              "        .long 1b,3b\n"
> -              ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
> -                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
> -              break;
> -          case 2:
> -            __asm__ __volatile__(
> -              "1:      mov" "w" " %2,%" "w" "1\n"
> -              "2:\n"
> -              ".section .fixup,\"ax\"\n"
> -              "3:      movl %3,%0\n"
> -              "        xor" "w" " %" "w" "1,%" "w" "1\n"
> -              "        jmp 2b\n"
> -              ".section __ex_table,\"a\"\n"
> -              "        .align 4\n"
> -              "        .long 1b,3b\n"
> -              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
> -                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
> -              break;
> -          case 4:
> -            __asm__ __volatile__(
> -              "1:      mov" "l" " %2,%" "" "1\n"
> -              "2:\n"
> -              ".section .fixup,\"ax\"\n"
> -              "3:      movl %3,%0\n"
> -              "        xor" "l" " %" "" "1,%" "" "1\n"
> -              "        jmp 2b\n"
> -              ".section __ex_table,\"a\"\n"
> -              "        .align 4\n"        "        .long 1b,3b\n"
> -              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
> -                            (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
> -              break;
> -          default:
> -            (__gu_val) = __get_user_bad();
> -        }
> -      } while (0) ;
> -    ((c)) = (__typeof__(*((buf))))__gu_val;
> -    __gu_err;
> -  }
> -);
> -
> -WOW! Black GCC/assembly magic. This is impossible to follow, so let's
> -see what code gcc generates:
> -
> - >         xorl %edx,%edx
> - >         movl current_set,%eax
> - >         cmpl $24,788(%eax)
> - >         je .L1424
> - >         cmpl $-1073741825,64(%esp)
> - >         ja .L1423
> - > .L1424:
> - >         movl %edx,%eax
> - >         movl 64(%esp),%ebx
> - > #APP
> - > 1:      movb (%ebx),%dl                /* this is the actual user access */
> - > 2:
> - > .section .fixup,"ax"
> - > 3:      movl $-14,%eax
> - >         xorb %dl,%dl
> - >         jmp 2b
> - > .section __ex_table,"a"
> - >         .align 4
> - >         .long 1b,3b
> - > .text
> - > #NO_APP
> - > .L1423:
> - >         movzbl %dl,%esi
> -
> -The optimizer does a good job and gives us something we can actually
> -understand. Can we? The actual user access is quite obvious. Thanks
> -to the unified address space we can just access the address in user
> -memory. But what does the .section stuff do?????
> -
> -To understand this we have to look at the final kernel:
> -
> - > objdump --section-headers vmlinux
> - >
> - > vmlinux:     file format elf32-i386
> - >
> - > Sections:
> - > Idx Name          Size      VMA       LMA       File off  Algn
> - >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
> - >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
> - >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
> - >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
> - >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
> - >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
> - >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
> - >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
> - >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
> - >                   CONTENTS, ALLOC, LOAD, DATA
> - >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
> - >                   ALLOC
> - >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
> - >                   CONTENTS, READONLY
> - >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
> - >                   CONTENTS, READONLY
> -
> -There are obviously 2 non standard ELF sections in the generated object
> -file. But first we want to find out what happened to our code in the
> -final kernel executable:
> -
> - > objdump --disassemble --section=.text vmlinux
> - >
> - > c017e785 <do_con_write+c1> xorl   %edx,%edx
> - > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
> - > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
> - > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
> - > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
> - > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
> - > c017e79f <do_con_write+db> movl   %edx,%eax
> - > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
> - > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> - > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
> -
> -The whole user memory access is reduced to 10 x86 machine instructions.
> -The instructions bracketed in the .section directives are no longer
> -in the normal execution path. They are located in a different section
> -of the executable file:
> -
> - > objdump --disassemble --section=.fixup vmlinux
> - >
> - > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
> - > c0199ffa <.fixup+10ba> xorb   %dl,%dl
> - > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
> -
> -And finally:
> - > objdump --full-contents --section=__ex_table vmlinux
> - >
> - >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
> - >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
> - >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
> -
> -or in human readable byte order:
> -
> - >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
> - >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
> -                               ^^^^^^^^^^^^^^^^^
> -                               this is the interesting part!
> - >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
> -
> -What happened? The assembly directives
> -
> -.section .fixup,"ax"
> -.section __ex_table,"a"
> -
> -told the assembler to move the following code to the specified
> -sections in the ELF object file. So the instructions
> -3:      movl $-14,%eax
> -        xorb %dl,%dl
> -        jmp 2b
> -ended up in the .fixup section of the object file and the addresses
> -        .long 1b,3b
> -ended up in the __ex_table section of the object file. 1b and 3b
> -are local labels. The local label 1b (1b stands for next label 1
> -backward) is the address of the instruction that might fault, i.e.
> -in our case the address of the label 1 is c017e7a5:
> -the original assembly code: > 1:      movb (%ebx),%dl
> -and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> -
> -The local label 3 (backwards again) is the address of the code to handle
> -the fault, in our case the actual value is c0199ff5:
> -the original assembly code: > 3:      movl $-14,%eax
> -and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
> -
> -The assembly code
> - > .section __ex_table,"a"
> - >         .align 4
> - >         .long 1b,3b
> -
> -becomes the value pair
> - >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
> -                               ^this is ^this is
> -                               1b       3b
> -c017e7a5,c0199ff5 in the exception table of the kernel.
> -
> -So, what actually happens if a fault from kernel mode with no suitable
> -vma occurs?
> -
> -1.) access to invalid address:
> - > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> -2.) MMU generates exception
> -3.) CPU calls do_page_fault
> -4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
> -5.) search_exception_table looks up the address c017e7a5 in the
> -    exception table (i.e. the contents of the ELF section __ex_table)
> -    and returns the address of the associated fault handle code c0199ff5.
> -6.) do_page_fault modifies its own return address to point to the fault
> -    handle code and returns.
> -7.) execution continues in the fault handling code.
> -8.) 8a) EAX becomes -EFAULT (== -14)
> -    8b) DL  becomes zero (the value we "read" from user space)
> -    8c) execution continues at local label 2 (address of the
> -        instruction immediately after the faulting user access).
> -
> -The steps 8a to 8c in a certain way emulate the faulting instruction.
> -
> -That's it, mostly. If you look at our example, you might ask why
> -we set EAX to -EFAULT in the exception handler code. Well, the
> -get_user macro actually returns a value: 0, if the user access was
> -successful, -EFAULT on failure. Our original code did not test this
> -return value, however the inline assembly code in get_user tries to
> -return -EFAULT. GCC selected EAX to return this value.
> -
> -NOTE:
> -Due to the way that the exception table is built and needs to be ordered,
> -only use exceptions for code in the .text section.  Any other section
> -will cause the exception table to not be sorted correctly, and the
> -exceptions will fail.
> Index: linux-2.6/Documentation/x86/00-INDEX
> ===================================================================
> --- linux-2.6.orig/Documentation/x86/00-INDEX
> +++ linux-2.6/Documentation/x86/00-INDEX
> @@ -2,3 +2,5 @@
>  	- this file
>  mtrr.txt
>  	- how to use x86 Memory Type Range Registers to increase performance
> +exception-tables.txt
> +	- why and how Linux kernel uses exception tables on x86
> Index: linux-2.6/Documentation/x86/exception-tables.txt
> ===================================================================
> --- /dev/null
> +++ linux-2.6/Documentation/x86/exception-tables.txt
> @@ -0,0 +1,292 @@
> +     Kernel level exception handling in Linux
> +  Commentary by Joerg Pommnitz <joerg@...eigh.ibm.com>
> +
> +When a process runs in kernel mode, it often has to access user
> +mode memory whose address has been passed by an untrusted program.
> +To protect itself the kernel has to verify this address.
> +
> +In older versions of Linux this was done with the
> +int verify_area(int type, const void * addr, unsigned long size)
> +function (which has since been replaced by access_ok()).
> +
> +This function verified that the memory area starting at address
> +'addr' and of size 'size' was accessible for the operation specified
> +in type (read or write). To do this, verify_read had to look up the
> +virtual memory area (vma) that contained the address addr. In the
> +normal case (correctly working program), this test was successful.
> +It only failed for a few buggy programs. In some kernel profiling
> +tests, this normally unneeded verification used up a considerable
> +amount of time.
> +
> +To overcome this situation, Linus decided to let the virtual memory
> +hardware present in every Linux-capable CPU handle this test.
> +
> +How does this work?
> +
> +Whenever the kernel tries to access an address that is currently not
> +accessible, the CPU generates a page fault exception and calls the
> +page fault handler
> +
> +void do_page_fault(struct pt_regs *regs, unsigned long error_code)
> +
> +in arch/x86/mm/fault.c. The parameters on the stack are set up by
> +the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
> +regs is a pointer to the saved registers on the stack, error_code
> +contains a reason code for the exception.
> +
> +do_page_fault first obtains the unaccessible address from the CPU
> +control register CR2. If the address is within the virtual address
> +space of the process, the fault probably occurred, because the page
> +was not swapped in, write protected or something similar. However,
> +we are interested in the other case: the address is not valid, there
> +is no vma that contains this address. In this case, the kernel jumps
> +to the bad_area label.
> +
> +There it uses the address of the instruction that caused the exception
> +(i.e. regs->eip) to find an address where the execution can continue
> +(fixup). If this search is successful, the fault handler modifies the
> +return address (again regs->eip) and returns. The execution will
> +continue at the address in fixup.
> +
> +Where does fixup point to?
> +
> +Since we jump to the contents of fixup, fixup obviously points
> +to executable code. This code is hidden inside the user access macros.
> +I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
> +as an example. The definition is somewhat hard to follow, so let's peek at
> +the code generated by the preprocessor and the compiler. I selected
> +the get_user call in drivers/char/sysrq.c for a detailed examination.
> +
> +The original code in sysrq.c line 587:
> +        get_user(c, buf);
> +
> +The preprocessor output (edited to become somewhat readable):
> +
> +(
> +  {
> +    long __gu_err = - 14 , __gu_val = 0;
> +    const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
> +    if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
> +       (((sizeof(*(buf))) <= 0xC0000000UL) &&
> +       ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
> +      do {
> +        __gu_err  = 0;
> +        switch ((sizeof(*(buf)))) {
> +          case 1:
> +            __asm__ __volatile__(
> +              "1:      mov" "b" " %2,%" "b" "1\n"
> +              "2:\n"
> +              ".section .fixup,\"ax\"\n"
> +              "3:      movl %3,%0\n"
> +              "        xor" "b" " %" "b" "1,%" "b" "1\n"
> +              "        jmp 2b\n"
> +              ".section __ex_table,\"a\"\n"
> +              "        .align 4\n"
> +              "        .long 1b,3b\n"
> +              ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
> +                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
> +              break;
> +          case 2:
> +            __asm__ __volatile__(
> +              "1:      mov" "w" " %2,%" "w" "1\n"
> +              "2:\n"
> +              ".section .fixup,\"ax\"\n"
> +              "3:      movl %3,%0\n"
> +              "        xor" "w" " %" "w" "1,%" "w" "1\n"
> +              "        jmp 2b\n"
> +              ".section __ex_table,\"a\"\n"
> +              "        .align 4\n"
> +              "        .long 1b,3b\n"
> +              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
> +                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
> +              break;
> +          case 4:
> +            __asm__ __volatile__(
> +              "1:      mov" "l" " %2,%" "" "1\n"
> +              "2:\n"
> +              ".section .fixup,\"ax\"\n"
> +              "3:      movl %3,%0\n"
> +              "        xor" "l" " %" "" "1,%" "" "1\n"
> +              "        jmp 2b\n"
> +              ".section __ex_table,\"a\"\n"
> +              "        .align 4\n"        "        .long 1b,3b\n"
> +              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
> +                            (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
> +              break;
> +          default:
> +            (__gu_val) = __get_user_bad();
> +        }
> +      } while (0) ;
> +    ((c)) = (__typeof__(*((buf))))__gu_val;
> +    __gu_err;
> +  }
> +);
> +
> +WOW! Black GCC/assembly magic. This is impossible to follow, so let's
> +see what code gcc generates:
> +
> + >         xorl %edx,%edx
> + >         movl current_set,%eax
> + >         cmpl $24,788(%eax)
> + >         je .L1424
> + >         cmpl $-1073741825,64(%esp)
> + >         ja .L1423
> + > .L1424:
> + >         movl %edx,%eax
> + >         movl 64(%esp),%ebx
> + > #APP
> + > 1:      movb (%ebx),%dl                /* this is the actual user access */
> + > 2:
> + > .section .fixup,"ax"
> + > 3:      movl $-14,%eax
> + >         xorb %dl,%dl
> + >         jmp 2b
> + > .section __ex_table,"a"
> + >         .align 4
> + >         .long 1b,3b
> + > .text
> + > #NO_APP
> + > .L1423:
> + >         movzbl %dl,%esi
> +
> +The optimizer does a good job and gives us something we can actually
> +understand. Can we? The actual user access is quite obvious. Thanks
> +to the unified address space we can just access the address in user
> +memory. But what does the .section stuff do?????
> +
> +To understand this we have to look at the final kernel:
> +
> + > objdump --section-headers vmlinux
> + >
> + > vmlinux:     file format elf32-i386
> + >
> + > Sections:
> + > Idx Name          Size      VMA       LMA       File off  Algn
> + >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
> + >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
> + >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
> + >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
> + >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
> + >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
> + >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
> + >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
> + >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
> + >                   CONTENTS, ALLOC, LOAD, DATA
> + >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
> + >                   ALLOC
> + >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
> + >                   CONTENTS, READONLY
> + >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
> + >                   CONTENTS, READONLY
> +
> +There are obviously 2 non standard ELF sections in the generated object
> +file. But first we want to find out what happened to our code in the
> +final kernel executable:
> +
> + > objdump --disassemble --section=.text vmlinux
> + >
> + > c017e785 <do_con_write+c1> xorl   %edx,%edx
> + > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
> + > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
> + > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
> + > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
> + > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
> + > c017e79f <do_con_write+db> movl   %edx,%eax
> + > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
> + > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> + > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
> +
> +The whole user memory access is reduced to 10 x86 machine instructions.
> +The instructions bracketed in the .section directives are no longer
> +in the normal execution path. They are located in a different section
> +of the executable file:
> +
> + > objdump --disassemble --section=.fixup vmlinux
> + >
> + > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
> + > c0199ffa <.fixup+10ba> xorb   %dl,%dl
> + > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
> +
> +And finally:
> + > objdump --full-contents --section=__ex_table vmlinux
> + >
> + >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
> + >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
> + >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
> +
> +or in human readable byte order:
> +
> + >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
> + >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
> +                               ^^^^^^^^^^^^^^^^^
> +                               this is the interesting part!
> + >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
> +
> +What happened? The assembly directives
> +
> +.section .fixup,"ax"
> +.section __ex_table,"a"
> +
> +told the assembler to move the following code to the specified
> +sections in the ELF object file. So the instructions
> +3:      movl $-14,%eax
> +        xorb %dl,%dl
> +        jmp 2b
> +ended up in the .fixup section of the object file and the addresses
> +        .long 1b,3b
> +ended up in the __ex_table section of the object file. 1b and 3b
> +are local labels. The local label 1b (1b stands for next label 1
> +backward) is the address of the instruction that might fault, i.e.
> +in our case the address of the label 1 is c017e7a5:
> +the original assembly code: > 1:      movb (%ebx),%dl
> +and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> +
> +The local label 3 (backwards again) is the address of the code to handle
> +the fault, in our case the actual value is c0199ff5:
> +the original assembly code: > 3:      movl $-14,%eax
> +and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
> +
> +The assembly code
> + > .section __ex_table,"a"
> + >         .align 4
> + >         .long 1b,3b
> +
> +becomes the value pair
> + >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
> +                               ^this is ^this is
> +                               1b       3b
> +c017e7a5,c0199ff5 in the exception table of the kernel.
> +
> +So, what actually happens if a fault from kernel mode with no suitable
> +vma occurs?
> +
> +1.) access to invalid address:
> + > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
> +2.) MMU generates exception
> +3.) CPU calls do_page_fault
> +4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
> +5.) search_exception_table looks up the address c017e7a5 in the
> +    exception table (i.e. the contents of the ELF section __ex_table)
> +    and returns the address of the associated fault handle code c0199ff5.
> +6.) do_page_fault modifies its own return address to point to the fault
> +    handle code and returns.
> +7.) execution continues in the fault handling code.
> +8.) 8a) EAX becomes -EFAULT (== -14)
> +    8b) DL  becomes zero (the value we "read" from user space)
> +    8c) execution continues at local label 2 (address of the
> +        instruction immediately after the faulting user access).
> +
> +The steps 8a to 8c in a certain way emulate the faulting instruction.
> +
> +That's it, mostly. If you look at our example, you might ask why
> +we set EAX to -EFAULT in the exception handler code. Well, the
> +get_user macro actually returns a value: 0, if the user access was
> +successful, -EFAULT on failure. Our original code did not test this
> +return value, however the inline assembly code in get_user tries to
> +return -EFAULT. GCC selected EAX to return this value.
> +
> +NOTE:
> +Due to the way that the exception table is built and needs to be ordered,
> +only use exceptions for code in the .text section.  Any other section
> +will cause the exception table to not be sorted correctly, and the
> +exceptions will fail.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/