lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 9 Mar 2021 15:49:13 +0200
From:   Topi Miettinen <toiwoton@...il.com>
To:     Kees Cook <keescook@...omium.org>
Cc:     linux-hardening@...r.kernel.org, akpm@...ux-foundation.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Andy Lutomirski <luto@...nel.org>,
        Jann Horn <jannh@...gle.com>,
        Linux API <linux-api@...r.kernel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Mike Rapoport <rppt@...nel.org>, Vlad Rezki <urezki@...il.com>
Subject: Re: [PATCH v3] mm/vmalloc: randomize vmalloc() allocations

On 8.3.2021 23.38, Kees Cook wrote:
> On Mon, Feb 15, 2021 at 10:26:34PM +0200, Topi Miettinen wrote:
>> Memory mappings inside kernel allocated with vmalloc() are in
>> predictable order and packed tightly toward the low addresses, except
>> for per-cpu areas which start from top of the vmalloc area. With
>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>> used randomly to make the allocations less predictable and harder to
>> guess for attackers. Also module and BPF code locations get randomized
>> (within their dedicated and rather small area though) and if
>> CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
>>
>> On 32 bit systems this may cause problems due to increased VM
>> fragmentation if the address space gets crowded.
>>
>> On all systems, it will reduce performance and increase memory and
>> cache usage due to less efficient use of page tables and inability to
>> merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
>> page tables, in the worst case, additional page table entries of up to
>> 4 pages are created for each mapping, so with small mappings there's
>> considerable penalty.
>>
>> Without randomize_vmalloc=1:
>> $ grep -v kernel_clone /proc/vmallocinfo
>> 0xffffc90000000000-0xffffc90000009000   36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
>> 0xffffc90000009000-0xffffc9000000b000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
>> 0xffffc9000000c000-0xffffc9000000f000   12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
>> 0xffffc9000000f000-0xffffc90000011000    8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
>> 0xffffc90000011000-0xffffc90000013000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000013000-0xffffc90000015000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000015000-0xffffc90000017000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000021000-0xffffc90000023000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000023000-0xffffc90000025000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
>> 0xffffc90000025000-0xffffc90000027000    8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
>> 0xffffc90000031000-0xffffc90000036000   20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
>> 0xffffc90000043000-0xffffc90000047000   16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
>> 0xffffc90000211000-0xffffc90000232000  135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
>> 0xffffc90000232000-0xffffc90000253000  135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
>> 0xffffc900005a9000-0xffffc900005ba000   69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
>> 0xffffc900005ba000-0xffffc900005cc000   73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
>> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
>>
>> With randomize_vmalloc=1, the allocations are randomized:
>> $ grep -v kernel_clone /proc/vmallocinfo
>> 0xffffc9759d443000-0xffffc9759d445000    8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
>> 0xffffccf1e9f66000-0xffffccf1e9f68000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffcd2fc02a4000-0xffffcd2fc02a6000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffcdaefb898000-0xffffcdaefb89b000   12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
>> 0xffffcef8074c3000-0xffffcef8074cc000   36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
>> 0xffffcf725ca2e000-0xffffcf725ca4f000  135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
>> 0xffffd0efb25e1000-0xffffd0efb25f2000   69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
>> 0xffffd27054678000-0xffffd2705467c000   16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
>> 0xffffd2adf716e000-0xffffd2adf7180000   73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
>> 0xffffd4ba5fb6b000-0xffffd4ba5fb6d000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
>> 0xffffded126192000-0xffffded126194000    8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
>> 0xffffe01a4dbcd000-0xffffe01a4dbcf000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffe4b649952000-0xffffe4b649954000    8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
>> 0xffffe71ed592a000-0xffffe71ed592c000    8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffe7dc5824f000-0xffffe7dc58270000  135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
>> 0xffffe8f4f9800000-0xffffe8f4f9a00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
>> 0xffffe8f4f9a19000-0xffffe8f4f9a1e000   20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
>>
>> With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
>> vmalloc area and therefore they also get randomized (only one example
>> line from /proc/vmallocinfo shown for brevity):
>>
>> unrandomized:
>> 0xffffc90000018000-0xffffc90000021000   36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>>
>> randomized:
>> 0xffffcb57611a8000-0xffffcb57611b1000   36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>>
>> CC: Andrew Morton <akpm@...ux-foundation.org>
>> CC: Andy Lutomirski <luto@...nel.org>
>> CC: Jann Horn <jannh@...gle.com>
>> CC: Kees Cook <keescook@...omium.org>
>> CC: Linux API <linux-api@...r.kernel.org>
>> CC: Matthew Wilcox <willy@...radead.org>
>> CC: Mike Rapoport <rppt@...nel.org>
>> CC: Vlad Rezki <urezki@...il.com>
>> Signed-off-by: Topi Miettinen <toiwoton@...il.com>
> 
> Thanks for working on this! I'd like to see this in the kernel, even if
> it's only for the more paranoia crowd. Have you done any workload
> measurements to see how much of a hit this has in the real world?

I compiled the kernel a few times on an idle machine and with 
randomize_vmalloc=1 the real times were 0.3% slower. Not sure if this is 
a good measure for kernel performance. The randomization should increase 
use of page tables and they should also compete for CPU caches.

>> ---
>> v2: retry allocation from other end of vmalloc space in case of
>> failure (Matthew Wilcox), improve commit message and documentation
>> v3: randomize also percpu allocations (pcpu_get_vm_areas())
>> ---
>>   .../admin-guide/kernel-parameters.txt         | 23 ++++++++++++
>>   mm/vmalloc.c                                  | 36 +++++++++++++++++--
>>   2 files changed, 56 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index c722ec19cd00..38d6b5728ccc 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -4024,6 +4024,29 @@
>>   
>>   	ramdisk_start=	[RAM] RAM disk image start address
>>   
>> +	randomize_vmalloc= [KNL] Randomize vmalloc() allocations. With 1,
>> +			the entire vmalloc() area is used randomly to
>> +			make the allocations less predictable and
>> +			harder to guess for attackers. Also module and
>> +			BPF code locations get randomized (within
>> +			their dedicated and rather small area though)
>> +			and if CONFIG_VMAP_STACK is enabled, also
>> +			kernel thread stack locations.
>> +
>> +			On 32 bit systems this may cause problems due
>> +			to increased VM fragmentation if the address
>> +			space gets crowded.
>> +
>> +			On all systems, it will reduce performance and
>> +			increase memory and cache usage due to less
>> +			efficient use of page tables and inability to
>> +			merge adjacent VMAs with compatible
>> +			attributes. On x86_64 with 5 level page
>> +			tables, in the worst case, additional page
>> +			table entries of up to 4 pages are created for
>> +			each mapping, so with small mappings there's
>> +			considerable penalty.
>> +
>>   	random.trust_cpu={on,off}
>>   			[KNL] Enable or disable trusting the use of the
>>   			CPU's random number generator (if available) to
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 4d88fe5a277a..1e8e0ee1925f 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -34,6 +34,7 @@
>>   #include <linux/bitops.h>
>>   #include <linux/rbtree_augmented.h>
>>   #include <linux/overflow.h>
>> +#include <linux/random.h>
>>   
>>   #include <linux/uaccess.h>
>>   #include <asm/tlbflush.h>
>> @@ -1089,6 +1090,17 @@ adjust_va_to_fit_type(struct vmap_area *va,
>>   	return 0;
>>   }
>>   
>> +static int randomize_vmalloc = 0;
> 
> This should be __ro_after_init, and even better, it should be a static
> branch so there's no performance hit at all for the "disabled" case.
> I recommend:
> 
> static DEFINE_STATIC_KEY_FALSE_RO(randomize_vmalloc);
> 
> static int __init set_randomize_vmalloc(char *str)
> {
>         int ret;
>         bool bool_result;
> 
>         ret = kstrtobool(buf, &bool_result);
>         if (ret)
>                 return ret;
> 
>         if (bool_result)
>                 static_branch_enable(&randomize_kstack_offset);
>         else
>                 static_branch_disable(&randomize_kstack_offset);
>         return 1;
> }
> __setup("randomize_vmalloc=", set_randomize_vmalloc);
> 
> 
>> +
>>   /*
>>    * Returns a start address of the newly allocated area, if success.
>>    * Otherwise a vend is returned that indicates failure.
>> @@ -1162,7 +1174,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>>   				int node, gfp_t gfp_mask)
>>   {
>>   	struct vmap_area *va, *pva;
>> -	unsigned long addr;
>> +	unsigned long addr, voffset;
>>   	int purged = 0;
>>   	int ret;
>>   
>> @@ -1217,11 +1229,24 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>>   	if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
>>   		kmem_cache_free(vmap_area_cachep, pva);
>>   
>> +	/* Randomize allocation */
>> +	if (randomize_vmalloc) {
> 
> if (static_branch_unlikely(randomize_vmalloc)) {
> ...


Thanks, I'll post an updated version with static_branch.

> 
>> +		voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
>> +		voffset = PAGE_ALIGN(voffset);
>> +		if (voffset + size > vend - vstart)
>> +			voffset = vend - vstart - size;
>> +	} else
>> +		voffset = 0;
>> +
>>   	/*
>>   	 * If an allocation fails, the "vend" address is
>>   	 * returned. Therefore trigger the overflow path.
>>   	 */
>> -	addr = __alloc_vmap_area(size, align, vstart, vend);
>> +	addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
>> +
>> +	if (unlikely(addr == vend) && voffset)
>> +		/* Retry randomization from other end */
>> +		addr = __alloc_vmap_area(size, align, vstart, vstart + voffset + size);
>>   	spin_unlock(&free_vmap_area_lock);
>>   
>>   	if (unlikely(addr == vend))
>> @@ -3256,7 +3281,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>>   	start = offsets[area];
>>   	end = start + sizes[area];
>>   
>> -	va = pvm_find_va_enclose_addr(vmalloc_end);
>> +	if (randomize_vmalloc)
>> +		va = pvm_find_va_enclose_addr(vmalloc_start +
>> +					      (get_random_long() &
>> +					       (roundup_pow_of_two(vmalloc_end - vmalloc_start) - 1)));
>> +	else
>> +		va = pvm_find_va_enclose_addr(vmalloc_end);
>>   	base = pvm_determine_end_from_reverse(&va, align) - end;
>>   
>>   	while (true) {
>> -- 
>> 2.30.0
>>
> 
> But otherwise, yes please. It's a simple change that makes per-boot
> order of vmalloc allocations unpredictable. I'm for it! :)

Great and thanks for the review!

-Topi


Powered by blists - more mailing lists