lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96a692a0-08eb-6d64-d396-82bee5d7a0a1@deltatee.com>
Date:	Mon, 20 Jun 2016 22:35:16 -0600
From:	Logan Gunthorpe <logang@...tatee.com>
To:	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Kees Cook <keescook@...omium.org>
Cc:	"Rafael J. Wysocki" <rafael@...nel.org>,
	Borislav Petkov <bp@...en8.de>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	lkml <linux-kernel@...r.kernel.org>,
	John Stultz <john.stultz@...aro.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Stable <stable@...r.kernel.org>,
	Andy Lutomirski <luto@...nel.org>,
	Brian Gerst <brgerst@...il.com>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Linux PM list <linux-pm@...r.kernel.org>,
	Stephen Smalley <sds@...ho.nsa.gov>
Subject: Re: ktime_get_ts64() splat during resume

Hey Rafael,

This patch appears to be working on my laptop. Thanks.

Logan

On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:
> On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
>> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
>>> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
>>> <torvalds@...ux-foundation.org> wrote:
>>>> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@...ysocki.net> wrote:
>>>>>
>>>>> Overall, we seem to be heading towards the "really weird" territory here.
>>>>
>>>> So the whole commit that Boris bisected down to is weird to me.
>>>>
>>>> Why isn't the temporary text mapping just set up unconditionally in
>>>> the temp_level4_pgt?
>>>>
>>>> Why does it have that insane "let's leave the temp_level4_pgt alone
>>>> until we actually switch to it, and save away restore_pgd_addr and the
>>>> restore_pgd, to then be set up at restore time"?
>>>>
>>>> All the other temporary mappings are set up statically in the
>>>> temp_level4_pgt, why not that one?
>>>
>>> The text mapping in temp_level4_pgt has to map the image kernel's
>>> physical entry address to the same virtual address that the image
>>> kernel had for it, because the image kernel will switch over to its
>>> own page tables first and it will use its own kernel text mapping from
>>> that point on.  That may not be the same as the text mapping of the
>>> (currently running) restore (or "boot") kernel.
>>>
>>> So if we set up the text mapping in temp_level4_pgt upfront, we
>>> basically can't reference the original kernel text (or do any
>>> addressing relative to it) any more after switching over to
>>> temp_level4_pgt.
>>>
>>> For some reason I thought that was not doable, but now that I look at
>>> the code it looks like it can be done.  I'll try doing that.
>
> I recalled what the problem was. :-)
>
> In principle, the kernel text mapping in the image kernel may be different
> from the kernel text mapping in the restore ("boot") kernel, but the patch
> I posted a couple of hours ago actually assumed them to be the same, because
> it switched to temp_level4_pgt before jumping to the relocated code.
>
> To get rid of that implicit assumption it has to switch to temp_level4_pgt
> from the relocated code itself, but for that to work, the page containing
> the relocated code must be executable in the original page tables (it isn't
> usually).
>
> So updated patch is appended.
>
> Again, it works for me, but I'm wondering about everybody else.
>
> Thanks,
> Rafael
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@...tatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> ---
>  arch/x86/power/hibernate_64.c     |   93 ++++++++++++++++++++++++++++++++------
>  arch/x86/power/hibernate_asm_64.S |   55 +++++++++-------------
>  2 files changed, 104 insertions(+), 44 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
>   * Address to jump to in the last phase of restore in order to get to the image
>   * kernel's text (this value is passed in the image header).
>   */
> -unsigned long restore_jump_address __visible;
> +void *restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
>  /*
>   * Value of the cr3 register from before the hibernation (this value is passed
> @@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
>
>  void *relocated_restore_code __visible;
>
> +static int set_up_temporary_text_mapping(void)
> +{
> +	unsigned long vaddr = (unsigned long)restore_jump_address;
> +	unsigned long paddr = jump_address_phys & PMD_MASK;
> +	pmd_t *pmd;
> +	pud_t *pud;
> +
> +	/*
> +	 * The new mapping only has to cover the page containing the image
> +	 * kernel's entry point (jump_address_phys), because the switch over to
> +	 * it is carried out by relocated code running from a page allocated
> +	 * specifically for this purpose and covered by the identity mapping, so
> +	 * the temporary kernel text mapping is only needed for the final jump.
> +	 * Moreover, in that mapping the virtual address of the image kernel's
> +	 * entry point must be the same as its virtual address in the image
> +	 * kernel (restore_jump_address), so the image kernel's
> +	 * restore_registers() code doesn't find itself in a different area of
> +	 * the virtual address space after switching over to the original page
> +	 * tables used by the image kernel.
> +	 */
> +	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pud)
> +		return -ENOMEM;
> +
> +	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pmd)
> +		return -ENOMEM;
> +
> +	set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
> +	set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
> +	set_pgd(temp_level4_pgt + pgd_index(vaddr),
> +		__pgd(__pa(pud) | _KERNPG_TABLE));
> +
> +	return 0;
> +}
> +
>  static void *alloc_pgt_page(void *context)
>  {
>  	return (void *)get_safe_page(GFP_ATOMIC);
> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>  	if (!temp_level4_pgt)
>  		return -ENOMEM;
>
> -	/* It is safe to reuse the original kernel mapping */
> -	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> -		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> +	/* Prepare a temporary mapping for the kernel text */
> +	result = set_up_temporary_text_mapping();
> +	if (result)
> +		return result;
>
>  	/* Set up the direct mapping from scratch */
>  	for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +116,45 @@ static int set_up_temporary_mappings(voi
>  	return 0;
>  }
>
> +static int relocate_restore_code(void)
> +{
> +	unsigned long addr;
> +	pgd_t *pgd;
> +	pmd_t *pmd;
> +
> +	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	addr = (unsigned long)relocated_restore_code;
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(addr);
> +	pmd = pmd_offset(pud_offset(pgd, addr), addr);
> +	if (pmd_large(*pmd)) {
> +		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +	} else {
> +		pte_t *pte = pte_offset_kernel(pmd, addr);
> +
> +		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +	}
> +
> +	return 0;
> +}
> +
>  int swsusp_arch_resume(void)
>  {
>  	int error;
>
>  	/* We have got enough memory and from now on we cannot recover */
> -	if ((error = set_up_temporary_mappings()))
> +	error = set_up_temporary_mappings();
> +	if (error)
>  		return error;
>
> -	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> -	if (!relocated_restore_code)
> -		return -ENOMEM;
> -	memcpy(relocated_restore_code, &core_restore_code,
> -	       &restore_registers - &core_restore_code);
> +	error = relocate_restore_code();
> +	if (error)
> +		return error;
>
>  	restore_image();
>  	return 0;
> @@ -108,12 +172,13 @@ int pfn_is_nosave(unsigned long pfn)
>  }
>
>  struct restore_data_record {
> -	unsigned long jump_address;
> +	void *jump_address;
> +	unsigned long jump_address_phys;
>  	unsigned long cr3;
>  	unsigned long magic;
>  };
>
> -#define RESTORE_MAGIC	0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC	0x123456789ABCDEF0UL
>
>  /**
>   *	arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +191,8 @@ int arch_hibernation_header_save(void *a
>
>  	if (max_size < sizeof(struct restore_data_record))
>  		return -EOVERFLOW;
> -	rdr->jump_address = restore_jump_address;
> +	rdr->jump_address = &restore_registers;
> +	rdr->jump_address_phys = __pa_symbol(&restore_registers);
>  	rdr->cr3 = restore_cr3;
>  	rdr->magic = RESTORE_MAGIC;
>  	return 0;
> @@ -142,6 +208,7 @@ int arch_hibernation_header_restore(void
>  	struct restore_data_record *rdr = addr;
>
>  	restore_jump_address = rdr->jump_address;
> +	jump_address_phys = rdr->jump_address_phys;
>  	restore_cr3 = rdr->cr3;
>  	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>  }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>  	pushfq
>  	popq	pt_regs_flags(%rax)
>
> -	/* save the address of restore_registers */
> -	movq	$restore_registers, %rax
> -	movq	%rax, restore_jump_address(%rip)
>  	/* save cr3 */
>  	movq	%cr3, %rax
>  	movq	%rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>  ENDPROC(swsusp_arch_suspend)
>
>  ENTRY(restore_image)
> -	/* switch to temporary page tables */
> -	movq	$__PAGE_OFFSET, %rdx
> -	movq	temp_level4_pgt(%rip), %rax
> -	subq	%rdx, %rax
> -	movq	%rax, %cr3
> -	/* Flush TLB */
> -	movq	mmu_cr4_features(%rip), %rax
> -	movq	%rax, %rdx
> -	andq	$~(X86_CR4_PGE), %rdx
> -	movq	%rdx, %cr4;  # turn off PGE
> -	movq	%cr3, %rcx;  # flush TLB
> -	movq	%rcx, %cr3;
> -	movq	%rax, %cr4;  # turn PGE back on
> -
>  	/* prepare to jump to the image kernel */
> -	movq	restore_jump_address(%rip), %rax
> -	movq	restore_cr3(%rip), %rbx
> +	movq	restore_jump_address(%rip), %r8
> +	movq	restore_cr3(%rip), %r9
> +
> +	/* prepare to switch to temporary page tables */
> +	movq	temp_level4_pgt(%rip), %rax
> +	movq	mmu_cr4_features(%rip), %rbx
>
>  	/* prepare to copy image data to their original locations */
>  	movq	restore_pblist(%rip), %rdx
> +
> +	/* jump to relocated restore code */
>  	movq	relocated_restore_code(%rip), %rcx
>  	jmpq	*%rcx
>
>  	/* code below has been relocated to a safe page */
>  ENTRY(core_restore_code)
> +	/* switch to temporary page tables */
> +	movq	$__PAGE_OFFSET, %rcx
> +	subq	%rcx, %rax
> +	movq	%rax, %cr3
> +	/* flush TLB */
> +	movq	%rbx, %rcx
> +	andq	$~(X86_CR4_PGE), %rcx
> +	movq	%rcx, %cr4;  # turn off PGE
> +	movq	%cr3, %rcx;  # flush TLB
> +	movq	%rcx, %cr3;
> +	movq	%rbx, %cr4;  # turn PGE back on
>  .Lloop:
>  	testq	%rdx, %rdx
>  	jz	.Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>  	/* progress to the next pbe */
>  	movq	pbe_next(%rdx), %rdx
>  	jmp	.Lloop
> +
>  .Ldone:
>  	/* jump to the restore_registers address from the image header */
> -	jmpq	*%rax
> -	/*
> -	 * NOTE: This assumes that the boot kernel's text mapping covers the
> -	 * image kernel's page containing restore_registers and the address of
> -	 * this page is the same as in the image kernel's text mapping (it
> -	 * should always be true, because the text mapping is linear, starting
> -	 * from 0, and is supposed to cover the entire kernel text for every
> -	 * kernel).
> -	 *
> -	 * code below belongs to the image kernel
> -	 */
> +	jmpq	*%r8
>
> +	 /* code below belongs to the image kernel */
> +	.align PAGE_SIZE
>  ENTRY(restore_registers)
>  	FRAME_BEGIN
>  	/* go back to the original page tables */
> -	movq    %rbx, %cr3
> +	movq    %r9, %cr3
>
>  	/* Flush TLB, including "global" things (vmalloc) */
>  	movq	mmu_cr4_features(%rip), %rax
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ