lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5jK4w+2WOEeNqv2CU5wqTY3oDY3pWPgJkM7Lp2AmEZuj2g@mail.gmail.com>
Date:	Wed, 29 Jun 2016 07:48:03 -0700
From:	Kees Cook <keescook@...omium.org>
To:	Logan Gunthorpe <logang@...tatee.com>
Cc:	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	Borislav Petkov <bp@...en8.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Rafael J. Wysocki" <rafael@...nel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	lkml <linux-kernel@...r.kernel.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Andy Lutomirski <luto@...nel.org>,
	Brian Gerst <brgerst@...il.com>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Linux PM list <linux-pm@...r.kernel.org>,
	Stephen Smalley <sds@...ho.nsa.gov>
Subject: Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption
 during image restoration

On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <logang@...tatee.com> wrote:
> Hey,
>
> This version is working for me as well. Thanks.
>
> Logan
>
> On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
>>
>> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>>>
>>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@...tatee.com>
>>> wrote:
>>>>
>>>> Hey Rafael,
>>>>
>>>> This patch appears to be working on my laptop. Thanks.
>>>
>>>
>>> Same for me: resume still works with KASLR in my tests too.
>>
>>
>> Unfortunately, Boris still sees post-resume memory corruption with the
>> patch
>> you have tested, but that turns out to be a result of some super-weird
>> corruption of a pointer on the stack which leads to a store at a wrong
>> address
>> (and there's no way it can be related to the rest of the patch).
>>
>> We have verified that it can be avoided by rearranging
>> set_up_temporary_text_mapping()
>> to use fewer local variables and the appended patch contains this
>> modification.
>>
>> I also went on and changed relocate_restore_code() slightly in a similar
>> fashion,
>> but all of those changes should not affect the behavior (unless there's
>> something
>> insane going on behind the curtains, that is).
>>
>> Kees, Logan, Boris, please try this one and let me know if it works for
>> you.

Tested-by: Kees Cook <keescook@...omium.org>

I've done a few KASLR boots, and everything continues to look good to
me. Thanks!

-Kees

>>
>> Thanks,
>> Rafael
>>
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
>> during image restoration
>>
>> Logan Gunthorpe reports that hibernation stopped working reliably for
>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>> and rodata).
>>
>> That turns out to be a consequence of a long-standing issue with the
>> 64-bit image restoration code on x86, which is that the temporary
>> page tables set up by it to avoid page tables corruption when the
>> last bits of the image kernel's memory contents are copied into
>> their original page frames re-use the boot kernel's text mapping,
>> but that mapping may very well get corrupted just like any other
>> part of the page tables.  Of course, if that happens, the final
>> jump to the image kernel's entry point will go to nowhere.
>>
>> The exact reason why commit ab76f7b4ab23 matters here is that it
>> sometimes causes a PMD of a large page to be split into PTEs
>> that are allocated dynamically and get corrupted during image
>> restoration as described above.
>>
>> To fix that issue note that the code copying the last bits of the
>> image kernel's memory contents to the page frames occupied by them
>> previoulsy doesn't use the kernel text mapping, because it runs from
>> a special page covered by the identity mapping set up for that code
>> from scratch.  Hence, the kernel text mapping is only needed before
>> that code starts to run and then it will only be used just for the
>> final jump to the image kernel's entry point.
>>
>> Accordingly, the temporary page tables set up in swsusp_arch_resume()
>> on x86-64 need to contain the kernel text mapping too.  That mapping
>> is only going to be used for the final jump to the image kernel, so
>> it only needs to cover the image kernel's entry point, because the
>> first thing the image kernel does after getting control back is to
>> switch over to its own original page tables.  Moreover, the virtual
>> address of the image kernel's entry point in that mapping has to be
>> the same as the one mapped by the image kernel's page tables.
>>
>> With that in mind, modify the x86-64's arch_hibernation_header_save()
>> and arch_hibernation_header_restore() routines to pass the physical
>> address of the image kernel's entry point (in addition to its virtual
>> address) to the boot kernel (a small piece of assembly code involved
>> in passing the entry point's virtual address to the image kernel is
>> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
>> too to reflect the image header format change.
>>
>> Next, in set_up_temporary_mappings(), use the physical and virtual
>> addresses of the image kernel's entry point passed in the image
>> header to set up a minimum kernel text mapping (using memory pages
>> that won't be overwritten by the image kernel's memory contents) that
>> will map those addresses to each other as appropriate.
>>
>> This makes the concern about the possible corruption of the original
>> boot kernel text mapping go away and if the the minimum kernel text
>> mapping used for the final jump marks the image kernel's entry point
>> memory as executable, the jump to it is guaraneed to succeed.
>>
>> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
>> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
>> Reported-by: Logan Gunthorpe <logang@...tatee.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> ---
>>  arch/x86/power/hibernate_64.c     |   90
>> ++++++++++++++++++++++++++++++++------
>>  arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
>>  2 files changed, 102 insertions(+), 43 deletions(-)
>>
>> Index: linux-pm/arch/x86/power/hibernate_64.c
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_64.c
>> +++ linux-pm/arch/x86/power/hibernate_64.c
>> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>>   * kernel's text (this value is passed in the image header).
>>   */
>>  unsigned long restore_jump_address __visible;
>> +unsigned long jump_address_phys;
>>
>>  /*
>>   * Value of the cr3 register from before the hibernation (this value is
>> passed
>> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>>
>>  pgd_t *temp_level4_pgt __visible;
>>
>> -void *relocated_restore_code __visible;
>> +unsigned long relocated_restore_code __visible;
>> +
>> +static int set_up_temporary_text_mapping(void)
>> +{
>> +       pmd_t *pmd;
>> +       pud_t *pud;
>> +
>> +       /*
>> +        * The new mapping only has to cover the page containing the image
>> +        * kernel's entry point (jump_address_phys), because the switch
>> over to
>> +        * it is carried out by relocated code running from a page
>> allocated
>> +        * specifically for this purpose and covered by the identity
>> mapping, so
>> +        * the temporary kernel text mapping is only needed for the final
>> jump.
>> +        * Moreover, in that mapping the virtual address of the image
>> kernel's
>> +        * entry point must be the same as its virtual address in the
>> image
>> +        * kernel (restore_jump_address), so the image kernel's
>> +        * restore_registers() code doesn't find itself in a different
>> area of
>> +        * the virtual address space after switching over to the original
>> page
>> +        * tables used by the image kernel.
>> +        */
>> +       pud = (pud_t *)get_safe_page(GFP_ATOMIC);
>> +       if (!pud)
>> +               return -ENOMEM;
>> +
>> +       pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
>> +       if (!pmd)
>> +               return -ENOMEM;
>> +
>> +       set_pmd(pmd + pmd_index(restore_jump_address),
>> +               __pmd((jump_address_phys & PMD_MASK) |
>> __PAGE_KERNEL_LARGE_EXEC));
>> +       set_pud(pud + pud_index(restore_jump_address),
>> +               __pud(__pa(pmd) | _KERNPG_TABLE));
>> +       set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
>> +               __pgd(__pa(pud) | _KERNPG_TABLE));
>> +
>> +       return 0;
>> +}
>>
>>  static void *alloc_pgt_page(void *context)
>>  {
>> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>>         if (!temp_level4_pgt)
>>                 return -ENOMEM;
>>
>> -       /* It is safe to reuse the original kernel mapping */
>> -       set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
>> -               init_level4_pgt[pgd_index(__START_KERNEL_map)]);
>> +       /* Prepare a temporary mapping for the kernel text */
>> +       result = set_up_temporary_text_mapping();
>> +       if (result)
>> +               return result;
>>
>>         /* Set up the direct mapping from scratch */
>>         for (i = 0; i < nr_pfn_mapped; i++) {
>> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>>         return 0;
>>  }
>>
>> +static int relocate_restore_code(void)
>> +{
>> +       pgd_t *pgd;
>> +       pmd_t *pmd;
>> +
>> +       relocated_restore_code = get_safe_page(GFP_ATOMIC);
>> +       if (!relocated_restore_code)
>> +               return -ENOMEM;
>> +
>> +       memcpy((void *)relocated_restore_code, &core_restore_code,
>> PAGE_SIZE);
>> +
>> +       /* Make the page containing the relocated code executable */
>> +       pgd = (pgd_t *)__va(read_cr3()) +
>> pgd_index(relocated_restore_code);
>> +       pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
>> +                        relocated_restore_code);
>> +       if (pmd_large(*pmd)) {
>> +               set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>> +       } else {
>> +               pte_t *pte = pte_offset_kernel(pmd,
>> relocated_restore_code);
>> +
>> +               set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  int swsusp_arch_resume(void)
>>  {
>>         int error;
>>
>>         /* We have got enough memory and from now on we cannot recover */
>> -       if ((error = set_up_temporary_mappings()))
>> +       error = set_up_temporary_mappings();
>> +       if (error)
>>                 return error;
>>
>> -       relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
>> -       if (!relocated_restore_code)
>> -               return -ENOMEM;
>> -       memcpy(relocated_restore_code, &core_restore_code,
>> -              &restore_registers - &core_restore_code);
>> +       error = relocate_restore_code();
>> +       if (error)
>> +               return error;
>>
>>         restore_image();
>>         return 0;
>> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>>
>>  struct restore_data_record {
>>         unsigned long jump_address;
>> +       unsigned long jump_address_phys;
>>         unsigned long cr3;
>>         unsigned long magic;
>>  };
>>
>> -#define RESTORE_MAGIC  0x0123456789ABCDEFUL
>> +#define RESTORE_MAGIC  0x123456789ABCDEF0UL
>>
>>  /**
>>   *     arch_hibernation_header_save - populate the architecture specific
>> part
>> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>>
>>         if (max_size < sizeof(struct restore_data_record))
>>                 return -EOVERFLOW;
>> -       rdr->jump_address = restore_jump_address;
>> +       rdr->jump_address = (unsigned long)&restore_registers;
>> +       rdr->jump_address_phys = __pa_symbol(&restore_registers);
>>         rdr->cr3 = restore_cr3;
>>         rdr->magic = RESTORE_MAGIC;
>>         return 0;
>> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>>         struct restore_data_record *rdr = addr;
>>
>>         restore_jump_address = rdr->jump_address;
>> +       jump_address_phys = rdr->jump_address_phys;
>>         restore_cr3 = rdr->cr3;
>>         return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>>  }
>> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
>> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
>> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>>         pushfq
>>         popq    pt_regs_flags(%rax)
>>
>> -       /* save the address of restore_registers */
>> -       movq    $restore_registers, %rax
>> -       movq    %rax, restore_jump_address(%rip)
>>         /* save cr3 */
>>         movq    %cr3, %rax
>>         movq    %rax, restore_cr3(%rip)
>> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>>  ENDPROC(swsusp_arch_suspend)
>>
>>  ENTRY(restore_image)
>> -       /* switch to temporary page tables */
>> -       movq    $__PAGE_OFFSET, %rdx
>> -       movq    temp_level4_pgt(%rip), %rax
>> -       subq    %rdx, %rax
>> -       movq    %rax, %cr3
>> -       /* Flush TLB */
>> -       movq    mmu_cr4_features(%rip), %rax
>> -       movq    %rax, %rdx
>> -       andq    $~(X86_CR4_PGE), %rdx
>> -       movq    %rdx, %cr4;  # turn off PGE
>> -       movq    %cr3, %rcx;  # flush TLB
>> -       movq    %rcx, %cr3;
>> -       movq    %rax, %cr4;  # turn PGE back on
>> -
>>         /* prepare to jump to the image kernel */
>> -       movq    restore_jump_address(%rip), %rax
>> -       movq    restore_cr3(%rip), %rbx
>> +       movq    restore_jump_address(%rip), %r8
>> +       movq    restore_cr3(%rip), %r9
>> +
>> +       /* prepare to switch to temporary page tables */
>> +       movq    temp_level4_pgt(%rip), %rax
>> +       movq    mmu_cr4_features(%rip), %rbx
>>
>>         /* prepare to copy image data to their original locations */
>>         movq    restore_pblist(%rip), %rdx
>> +
>> +       /* jump to relocated restore code */
>>         movq    relocated_restore_code(%rip), %rcx
>>         jmpq    *%rcx
>>
>>         /* code below has been relocated to a safe page */
>>  ENTRY(core_restore_code)
>> +       /* switch to temporary page tables */
>> +       movq    $__PAGE_OFFSET, %rcx
>> +       subq    %rcx, %rax
>> +       movq    %rax, %cr3
>> +       /* flush TLB */
>> +       movq    %rbx, %rcx
>> +       andq    $~(X86_CR4_PGE), %rcx
>> +       movq    %rcx, %cr4;  # turn off PGE
>> +       movq    %cr3, %rcx;  # flush TLB
>> +       movq    %rcx, %cr3;
>> +       movq    %rbx, %cr4;  # turn PGE back on
>>  .Lloop:
>>         testq   %rdx, %rdx
>>         jz      .Ldone
>> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>>         /* progress to the next pbe */
>>         movq    pbe_next(%rdx), %rdx
>>         jmp     .Lloop
>> +
>>  .Ldone:
>>         /* jump to the restore_registers address from the image header */
>> -       jmpq    *%rax
>> -       /*
>> -        * NOTE: This assumes that the boot kernel's text mapping covers
>> the
>> -        * image kernel's page containing restore_registers and the
>> address of
>> -        * this page is the same as in the image kernel's text mapping (it
>> -        * should always be true, because the text mapping is linear,
>> starting
>> -        * from 0, and is supposed to cover the entire kernel text for
>> every
>> -        * kernel).
>> -        *
>> -        * code below belongs to the image kernel
>> -        */
>> +       jmpq    *%r8
>>
>> +        /* code below belongs to the image kernel */
>> +       .align PAGE_SIZE
>>  ENTRY(restore_registers)
>>         FRAME_BEGIN
>>         /* go back to the original page tables */
>> -       movq    %rbx, %cr3
>> +       movq    %r9, %cr3
>>
>>         /* Flush TLB, including "global" things (vmalloc) */
>>         movq    mmu_cr4_features(%rip), %rax
>>
>



-- 
Kees Cook
Chrome OS & Brillo Security

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ