lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 14 Mar 2021 18:23:12 +0100
From:   Uladzislau Rezki <urezki@...il.com>
To:     Topi Miettinen <toiwoton@...il.com>
Cc:     linux-hardening@...r.kernel.org, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Andy Lutomirski <luto@...nel.org>,
        Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        Linux API <linux-api@...r.kernel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mike Rapoport <rppt@...nel.org>, Vlad Rezki <urezki@...il.com>
Subject: Re: [PATCH v4] mm/vmalloc: randomize vmalloc() allocations

> Memory mappings inside kernel allocated with vmalloc() are in
> predictable order and packed tightly toward the low addresses, except
> for per-cpu areas which start from top of the vmalloc area. With
> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
> used randomly to make the allocations less predictable and harder to
> guess for attackers. Also module and BPF code locations get randomized
> (within their dedicated and rather small area though) and if
> CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
> 
> On 32 bit systems this may cause problems due to increased VM
> fragmentation if the address space gets crowded.
> 
> On all systems, it will reduce performance and increase memory and
> cache usage due to less efficient use of page tables and inability to
> merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
> page tables, in the worst case, additional page table entries of up to
> 4 pages are created for each mapping, so with small mappings there's
> considerable penalty.
> 
> Without randomize_vmalloc=1:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xffffc90000000000-0xffffc90000009000   36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xffffc90000009000-0xffffc9000000b000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
> 0xffffc9000000c000-0xffffc9000000f000   12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
> 0xffffc9000000f000-0xffffc90000011000    8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
> 0xffffc90000011000-0xffffc90000013000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000013000-0xffffc90000015000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000015000-0xffffc90000017000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000021000-0xffffc90000023000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffc90000023000-0xffffc90000025000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
> 0xffffc90000025000-0xffffc90000027000    8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
> 0xffffc90000031000-0xffffc90000036000   20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
> 0xffffc90000043000-0xffffc90000047000   16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
> 0xffffc90000211000-0xffffc90000232000  135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> 0xffffc90000232000-0xffffc90000253000  135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
> 0xffffc900005a9000-0xffffc900005ba000   69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> 0xffffc900005ba000-0xffffc900005cc000   73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
> 
> With randomize_vmalloc=1, the allocations are randomized:
> $ grep -v kernel_clone /proc/vmallocinfo
> 0xffffc9759d443000-0xffffc9759d445000    8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
> 0xffffccf1e9f66000-0xffffccf1e9f68000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffcd2fc02a4000-0xffffcd2fc02a6000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffcdaefb898000-0xffffcdaefb89b000   12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
> 0xffffcef8074c3000-0xffffcef8074cc000   36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
> 0xffffcf725ca2e000-0xffffcf725ca4f000  135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
> 0xffffd0efb25e1000-0xffffd0efb25f2000   69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
> 0xffffd27054678000-0xffffd2705467c000   16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
> 0xffffd2adf716e000-0xffffd2adf7180000   73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
> 0xffffd4ba5fb6b000-0xffffd4ba5fb6d000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
> 0xffffded126192000-0xffffded126194000    8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
> 0xffffe01a4dbcd000-0xffffe01a4dbcf000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe4b649952000-0xffffe4b649954000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
> 0xffffe71ed592a000-0xffffe71ed592c000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
> 0xffffe7dc5824f000-0xffffe7dc58270000  135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
> 0xffffe8f4f9800000-0xffffe8f4f9a00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
> 0xffffe8f4f9a19000-0xffffe8f4f9a1e000   20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
> 
> With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
> vmalloc area and therefore they also get randomized (only one example
> line from /proc/vmallocinfo shown for brevity):
> 
> unrandomized:
> 0xffffc90000018000-0xffffc90000021000   36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
> 
> randomized:
> 0xffffcb57611a8000-0xffffcb57611b1000   36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
> 
> CC: Andrew Morton <akpm@...ux-foundation.org>
> CC: Andy Lutomirski <luto@...nel.org>
> CC: Jann Horn <jannh@...gle.com>
> CC: Kees Cook <keescook@...omium.org>
> CC: Linux API <linux-api@...r.kernel.org>
> CC: Matthew Wilcox <willy@...radead.org>
> CC: Mike Rapoport <rppt@...nel.org>
> CC: Vlad Rezki <urezki@...il.com>
> Signed-off-by: Topi Miettinen <toiwoton@...il.com>
> ---
> v2: retry allocation from other end of vmalloc space in case of
> failure (Matthew Wilcox), improve commit message and documentation
> v3: randomize also percpu allocations (pcpu_get_vm_areas())
> v4: use static branches (Kees Cook) and make the parameter boolean.
> ---
>  .../admin-guide/kernel-parameters.txt         | 24 ++++++++++
>  mm/vmalloc.c                                  | 44 +++++++++++++++++--
>  2 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a10b545c2070..726aec542079 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4024,6 +4024,30 @@
>  
>  	ramdisk_start=	[RAM] RAM disk image start address
>  
> +	randomize_vmalloc= [KNL] Boolean option to randomize vmalloc()
> +			allocations. When enabled, the entire
> +			vmalloc() area is used randomly to make the
> +			allocations less predictable and harder to
> +			guess for attackers. Also module and BPF code
> +			locations get randomized (within their
> +			dedicated and rather small area though) and if
> +			CONFIG_VMAP_STACK is enabled, also kernel
> +			thread stack locations.
> +
> +			On 32 bit systems this may cause problems due
> +			to increased VM fragmentation if the address
> +			space gets crowded.
>
What kind of problems? Could you please more cpecific? I guess fail
ratio will be increased.

> +
> +			On all systems, it will reduce performance and
> +			increase memory and cache usage due to less
> +			efficient use of page tables and inability to
> +			merge adjacent VMAs with compatible
> +			attributes. On x86_64 with 5 level page
> +			tables, in the worst case, additional page
> +			table entries of up to 4 pages are created for
> +			each mapping, so with small mappings there's
> +			considerable penalty.
Could you please put test results to the commit message? You can run
"test_vmalloc.sh performance" on you system. It will give us some
figures to understand the performance difference.

> +
>  	random.trust_cpu={on,off}
>  			[KNL] Enable or disable trusting the use of the
>  			CPU's random number generator (if available) to
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e6f352bf0498..b5ecf27dc98e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -34,6 +34,7 @@
>  #include <linux/bitops.h>
>  #include <linux/rbtree_augmented.h>
>  #include <linux/overflow.h>
> +#include <linux/random.h>
>  
>  #include <linux/uaccess.h>
>  #include <asm/tlbflush.h>
> @@ -1089,6 +1090,25 @@ adjust_va_to_fit_type(struct vmap_area *va,
>  	return 0;
>  }
>  
> +static DEFINE_STATIC_KEY_FALSE_RO(randomize_vmalloc);
> +
> +static int __init set_randomize_vmalloc(char *str)
> +{
> +	int ret;
> +	bool bool_result;
> +
> +	ret = kstrtobool(str, &bool_result);
> +	if (ret)
> +		return ret;
> +
> +	if (bool_result)
> +		static_branch_enable(&randomize_vmalloc);
> +	else
> +		static_branch_disable(&randomize_vmalloc);
> +	return 1;
> +}
> +__setup("randomize_vmalloc=", set_randomize_vmalloc);
> +
>  /*
>   * Returns a start address of the newly allocated area, if success.
>   * Otherwise a vend is returned that indicates failure.
> @@ -1162,7 +1182,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				int node, gfp_t gfp_mask)
>  {
>  	struct vmap_area *va, *pva;
> -	unsigned long addr;
> +	unsigned long addr, voffset;
>  	int purged = 0;
>  	int ret;
>  
> @@ -1217,11 +1237,24 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
>  		kmem_cache_free(vmap_area_cachep, pva);
>  
> +	/* Randomize allocation */
> +	if (static_branch_unlikely(&randomize_vmalloc)) {
> +		voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
> +		voffset = PAGE_ALIGN(voffset);
> +		if (voffset + size > vend - vstart)
> +			voffset = vend - vstart - size;
> +	} else
> +		voffset = 0;
> +
Could you please wrap that change into a separate function? For example
randomize_voffset_with_range(start, end). 

>  	/*
>  	 * If an allocation fails, the "vend" address is
>  	 * returned. Therefore trigger the overflow path.
>  	 */
> -	addr = __alloc_vmap_area(size, align, vstart, vend);
> +	addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
> +
> +	if (unlikely(addr == vend) && voffset)
> +		/* Retry randomization from other end */
> +		addr = __alloc_vmap_area(size, align, vstart, vstart + voffset + size);
>  	spin_unlock(&free_vmap_area_lock);
>  
>  	if (unlikely(addr == vend))
> @@ -3258,7 +3291,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  	start = offsets[area];
>  	end = start + sizes[area];
>  
> -	va = pvm_find_va_enclose_addr(vmalloc_end);
> +	if (static_branch_unlikely(&randomize_vmalloc))
> +		va = pvm_find_va_enclose_addr(vmalloc_start +
> +					      (get_random_long() &
> +					       (roundup_pow_of_two(vmalloc_end - vmalloc_start) - 1)));
> +	else
> +		va = pvm_find_va_enclose_addr(vmalloc_end);
>  	base = pvm_determine_end_from_reverse(&va, align) - end;
As for per-cpu embedded alloator. Even though currently it is part of
the vmalloc space, it is not a vmalloc() allocation. Please do not change
its code. It does alloations by "chunks" where an internal structure
represent special memory layout that is used for actual allocations.


Also, using vmaloc test driver i can trigger a kernel BUG:

<snip>
[   24.627577] kernel BUG at mm/vmalloc.c:1272!
[   24.628645] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   24.628684] CPU: 30 PID: 929 Comm: vmalloc_test/0 Tainted: G            E     5.11.0-next-20210225-next #484
[   24.628724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[   24.628757] RIP: 0010:alloc_vmap_area.isra.53+0x593/0xf10
[   24.628784] Code: 41 5d 41 5e 41 5f c3 0f 0b 0f 0b 48 c7 44 24 10 f0 ff ff ff eb c9 48 8d 5a f0 e9 9c fc ff ff 48 c7 44 24 10 f4 ff ff ff eb b5 <0f> 0b 4c 8d 4b 10 48 39 d0 74 12 48 8b 44 24 18 31 ff 48 89 03 48
[   24.628853] RSP: 0018:ffffc4296cf67d38 EFLAGS: 00010206
[   24.628876] RAX: ffffd6db9e61a000 RBX: ffff8ae9c9309440 RCX: 0000000000000001
[   24.628905] RDX: 0000000080000001 RSI: ffff8ae9c0046be8 RDI: 00000000ffffffff
[   24.628933] RBP: 0000000000002000 R08: ffff8ae9c13699e8 R09: ffffb98000000000
[   24.628961] R10: ffffd6db9e61a000 R11: 000000003aa1c801 R12: ffff8ae9c9f0d280
[   24.628989] R13: 0000008000001fff R14: ffffff8000000000 R15: 0000007fffffffff
[   24.629019] FS:  0000000000000000(0000) GS:ffff8af8bf580000(0000) knlGS:0000000000000000
[   24.629051] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   24.629074] CR2: 000055916370aa80 CR3: 00000006bf40a000 CR4: 00000000000006e0
[   24.629103] Call Trace:
[   24.629128]  ? map_kernel_range_noflush+0x27a/0x360
[   24.629150]  ? kmem_cache_alloc_trace+0x340/0x460
[   24.629172]  __get_vm_area_node.isra.54+0xa7/0x150
[   24.629195]  ? fix_size_alloc_test+0x50/0x50 [test_vmalloc]
[   24.629221]  __vmalloc_node_range+0x64/0x230
[   24.629241]  ? test_func+0xdb/0x1f0 [test_vmalloc]
[   24.629263]  ? fix_size_alloc_test+0x50/0x50 [test_vmalloc]
[   24.629288]  __vmalloc_node+0x3b/0x40
[   24.629305]  ? test_func+0xdb/0x1f0 [test_vmalloc]
[   24.629326]  align_shift_alloc_test+0x39/0x50 [test_vmalloc]
[   24.629350]  test_func+0xdb/0x1f0 [test_vmalloc]
[   24.629372]  ? fix_align_alloc_test+0x50/0x50 [test_vmalloc]
[   24.629396]  kthread+0x13d/0x160
[   24.629413]  ? kthread_park+0x80/0x80
[   24.629431]  ret_from_fork+0x22/0x30
<snip>

--
Vlad Rezki

Powered by blists - more mailing lists