linux-kernel - Re: [PATCH v6 2/3] x86/mm: Implement ASLR for kernel memory sections (x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160617102612.GA18236@gmail.com>
Date:	Fri, 17 Jun 2016 12:26:12 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Kees Cook <keescook@...omium.org>
Cc:	Thomas Garnier <thgarnie@...gle.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
	Borislav Petkov <bp@...e.de>, Juergen Gross <jgross@...e.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Toshi Kani <toshi.kani@....com>, Baoquan He <bhe@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Dan Williams <dan.j.williams@...el.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Andy Lutomirski <luto@...nel.org>,
	Alexander Kuleshov <kuleshovmail@...il.com>,
	Alexander Popov <alpopov@...ecurity.com>,
	Joerg Roedel <jroedel@...e.de>, Dave Young <dyoung@...hat.com>,
	Lv Zheng <lv.zheng@...el.com>,
	Mark Salter <msalter@...hat.com>,
	Stephen Smalley <sds@...ho.nsa.gov>,
	Dmitry Vyukov <dvyukov@...gle.com>,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	David Rientjes <rientjes@...gle.com>,
	Christian Borntraeger <borntraeger@...ibm.com>,
	Jan Beulich <JBeulich@...e.com>,
	Kefeng Wang <wangkefeng.wang@...wei.com>,
	Seth Jennings <sjennings@...iantweb.net>,
	Yinghai Lu <yinghai@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 2/3] x86/mm: Implement ASLR for kernel memory sections
 (x86_64)


* Kees Cook <keescook@...omium.org> wrote:

> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1993,6 +1993,23 @@ config PHYSICAL_ALIGN
>  
>  	  Don't change this unless you know what you are doing.
>  
> +config RANDOMIZE_MEMORY
> +	bool "Randomize the kernel memory sections"
> +	depends on X86_64
> +	depends on RANDOMIZE_BASE
> +	default RANDOMIZE_BASE
> +	---help---
> +	   Randomizes the base virtual address of kernel memory sections
> +	   (physical memory mapping, vmalloc & vmemmap). This security feature
> +	   makes exploits relying on predictable memory locations less reliable.
> +
> +	   The order of allocations remains unchanged. Entropy is generated in
> +	   the same way as RANDOMIZE_BASE. Current implementation in the optimal
> +	   configuration have in average 30,000 different possible virtual
> +	   addresses for each memory section.
> +
> +	   If unsure, say N.

So this is really poor naming!

The user, in first approximation, will see something like this:

  Randomize the kernel memory sections (RANDOMIZE_MEMORY) [Y/n/?] (NEW) 

... and could naively conclude that this feature will randomize memory contents.

Furthermore, there's no apparent connection to another, related kernel feature:

  Randomize the address of the kernel image (KASLR) (RANDOMIZE_BASE) [Y/n/?]

A better naming would be something like:

     Randomize other static kernel memory addresses (RANDOMIZE_ALL) [Y/n/?] (NEW) 

(assuming it's truly 'all')

But ... I don't see it explained anywhere why a user, once he expressed interest 
in randomization would even want to _not_ randomize as much as possible. I.e. why 
does this new option exist at all, shouldn't we just gradually extend 
randomization to more memory areas and control it via CONFIG_RANDOMIZE_BASE?

Btw., CONFIG_RANDOMIZE_BASE is probably a misnomer, and after these patches it 
will be more of a misnomer - so we should rename it to something better. For 
example CONFIG_KASLR would have the advantage of being obviously named at a 
glance, to anyone who knows what 'ASLR' means. To those who don't know the short 
description will tell it:

  Kernel address space layout randomization (KASLR) [Y/n/?]

This would also fit the kernel internal naming of kaslr.c/etc.

What do you think?

> +++ b/arch/x86/mm/kaslr.c
> @@ -0,0 +1,167 @@
> +/*
> + * This file implements KASLR memory randomization for x86_64. It randomizes
> + * the virtual address space of kernel memory sections (physical memory
> + * mapping, vmalloc & vmemmap) for x86_64. This security feature mitigates
> + * exploits relying on predictable kernel addresses.
> + *
> + * Entropy is generated using the KASLR early boot functions now shared in
> + * the lib directory (originally written by Kees Cook). Randomization is
> + * done on PGD & PUD page table levels to increase possible addresses. The
> + * physical memory mapping code was adapted to support PUD level virtual
> + * addresses. This implementation on the best configuration provides 30,000
> + * possible virtual addresses in average for each memory section.  An
> + * additional low memory page is used to ensure each CPU can start with a
> + * PGD aligned virtual address (for realmode).
> + *
> + * The order of each memory section is not changed. The feature looks at
> + * the available space for the sections based on different configuration
> + * options and randomizes the base and space between each. The size of the
> + * physical memory mapping is the available physical memory.
> + *
> + */

(Stray newline.)

> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/mm.h>
> +#include <linux/smp.h>
> +#include <linux/init.h>
> +#include <linux/memory.h>
> +#include <linux/random.h>
> +
> +#include <asm/processor.h>
> +#include <asm/pgtable.h>
> +#include <asm/pgalloc.h>
> +#include <asm/e820.h>
> +#include <asm/init.h>
> +#include <asm/setup.h>
> +#include <asm/kaslr.h>
> +#include <asm/kasan.h>
> +
> +#include "mm_internal.h"

How many of those include lines are truly unnecessary, or did this just got copied 
from another file?

> +/*
> + * Memory base and end randomization is based on different configurations.
> + * We want as much space as possible to increase entropy available.
> + */
> +static const unsigned long memory_rand_start = __PAGE_OFFSET_BASE;
> +
> +#if defined(CONFIG_KASAN)
> +static const unsigned long memory_rand_end = KASAN_SHADOW_START;
> +#elif defined(CONFIG_X86_ESPFIX64)
> +static const unsigned long memory_rand_end = ESPFIX_BASE_ADDR;
> +#elif defined(CONFIG_EFI)
> +static const unsigned long memory_rand_end = EFI_VA_START;
> +#else
> +static const unsigned long memory_rand_end = __START_KERNEL_map;
> +#endif

It's unclear to me from the naming and the description whether these are virtual 
or physical memory ranges. Nor is it explained why this laundry list of kernel 
options influences the randomization range - and it's not explained how new kernel 
features modifying the virtual memory layout should plug into this mechanism.

Could we please get a bit more verbose introduction into why all of this is done?

> +/* Default values */
> +unsigned long page_offset_base = __PAGE_OFFSET_BASE;
> +EXPORT_SYMBOL(page_offset_base);
> +unsigned long vmalloc_base = __VMALLOC_BASE;
> +EXPORT_SYMBOL(vmalloc_base);
> +unsigned long vmemmap_base = __VMEMMAP_BASE;
> +EXPORT_SYMBOL(vmemmap_base);
> +
> +/* Describe each randomized memory sections in sequential order */
> +static __initdata struct kaslr_memory_region {
> +	unsigned long *base;
> +	unsigned short size_tb;
> +} kaslr_regions[] = {
> +	{ &page_offset_base, 64/* Maximum */ },
> +	{ &vmalloc_base, VMALLOC_SIZE_TB },
> +	{ &vmemmap_base, 1 },
> +};

So this comment is totally misleading. This area does not 'describe' each memory 
section, it actually puts live pointers to virtual memory address offset control 
variables used by other subsystems into this array - allowing us to randomize them 
programmatically.

This patch should be split up into 4 components:

 - core patch adding all the infrastructure but not actually randomizing anything,
   the kaslr_regions[] array is empty.

 - a patch adding page_offset_base randomization.

 - a patch adding vmalloc_base randomization.

 - a patch adding vmemmap_base randomization.

This way if any breakage is introduced by this randomization it can be bisected to 
in a finegrained fashion. It would also have saved 5 minutes from my life trying 
to interpret the code here :-/

In fact I'd suggest a series of 7 patches, with each randomization patch split 
into two further patches:

 - a patch making address offset variables dynamic.

 - a patch actually enabling the randomization of that virtual address range.

This structure allows us to carefully evaluate the code size and runtime impact of 
randomizing a given range.

> +
> +/* Size in Terabytes + 1 hole */
> +static __init unsigned long get_padding(struct kaslr_memory_region *region)
> +{
> +	return ((unsigned long)region->size_tb + 1) << TB_SHIFT;
> +}

So why is size_tb a short, forcing ugly type casts like this? Also, since this 
appears to be a trivial shift, why isn't this an inline?

> +/* Initialize base and padding for each memory section randomized with KASLR */
> +void __init kernel_randomize_memory(void)
> +{
> +	size_t i;
> +	unsigned long addr = memory_rand_start;
> +	unsigned long padding, rand, mem_tb;
> +	struct rnd_state rnd_st;
> +	unsigned long remain_padding = memory_rand_end - memory_rand_start;

Yeah, so these variable names are awful :-/

Look at how postfix and prefix naming is mixed in the same function: 
'memory_rand_start', but 'remain_padding'. Please use postfixes for all of them.

Also why is one variable named 'rand', another 'rnd'? Did we run out of 'a' 
characters?

Similarly, we have 'memory' and 'mem'. Pick one variant and harmonize!

This patch is an object lesson in how fragile, insecure code is written.

> +
> +	/*
> +	 * All these BUILD_BUG_ON checks ensures the memory layout is
> +	 * consistent with the current KASLR design.
> +	 */
> +	BUILD_BUG_ON(memory_rand_start >= memory_rand_end);
> +	BUILD_BUG_ON(config_enabled(CONFIG_KASAN) &&
> +		memory_rand_end >= ESPFIX_BASE_ADDR);
> +	BUILD_BUG_ON((config_enabled(CONFIG_KASAN) ||
> +			config_enabled(CONFIG_X86_ESPFIX64)) &&
> +		memory_rand_end >= EFI_VA_START);
> +	BUILD_BUG_ON((config_enabled(CONFIG_KASAN) ||
> +			config_enabled(CONFIG_X86_ESPFIX64) ||
> +			config_enabled(CONFIG_EFI)) &&
> +		memory_rand_end >= __START_KERNEL_map);
> +	BUILD_BUG_ON(memory_rand_end > __START_KERNEL_map);
> +
> +	if (!kaslr_enabled())
> +		return;
> +
> +	BUG_ON(kaslr_regions[0].base != &page_offset_base);
> +	mem_tb = ((max_pfn << PAGE_SHIFT) >> TB_SHIFT);

Why is the granularity of the randomization 1TB? I don't see this explained 
anywhere. Some of the randomization may have PUD-granularity limitation - but PUD 
entries have a size of 512 GB.

I think we should extend the kaslr_regions[] table with a minimum required 
alignment field instead of random granularity limitations.

> +/*
> + * Create PGD aligned trampoline table to allow real mode initialization
> + * of additional CPUs. Consume only 1 low memory page.
> + */
> +void __meminit init_trampoline(void)

This too should probably be a separate patch, for bisectability and reviewability 
reasons.

> +	unsigned long addr, next;
> +	pgd_t *pgd;
> +	pud_t *pud_page, *tr_pud_page;
> +	int i;

I had to look twice to understand that 'tr' stands for 'trampoline'. Please rename 
to 'pud_page_tramp' or so.

> +	if (!kaslr_enabled())
> +		return;
> +
> +	tr_pud_page = alloc_low_page();
> +	set_pgd(&trampoline_pgd_entry, __pgd(_PAGE_TABLE | __pa(tr_pud_page)));

Shouldn't we set it _after_ we've initialized the tr_pud_page page table?

Also, why _PAGE_TABLE? AFAICS using _PAGE_TABLE adds _PAGE_USER.

> +
> +	addr = 0;
> +	pgd = pgd_offset_k((unsigned long)__va(addr));

Why the type cast?

Also, similar complaint to the first patch: why is a physical address named in an 
ambiguous fashion, as 'addr'?

> +	pud_page = (pud_t *) pgd_page_vaddr(*pgd);

Side note: it appears all users of pgd_page_vaddr() are casting it immediately to 
a pointer type. Could we change the return type of this function to 'void *' or 
'pud_t *'? AFAICS it's mostly used in platform code.

> +
> +	for (i = pud_index(addr); i < PTRS_PER_PUD; i++, addr = next) {
> +		pud_t *pud, *tr_pud;
> +
> +		tr_pud = tr_pud_page + pud_index(addr);
> +		pud = pud_page + pud_index((unsigned long)__va(addr));

Why is this type cast needed?

> +		next = (addr & PUD_MASK) + PUD_SIZE;
> +
> +		/* Needed to copy pte or pud alike */
> +		BUILD_BUG_ON(sizeof(pud_t) != sizeof(pte_t));
> +		*tr_pud = *pud;

So I don't understand what brought the size of 'pte' into this code. We are 
copying pud entries around AFAICS, why should the size of the pte entry matter?

Thanks,

	Ingo>