[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1cb30d0f-eee7-b778-9f75-216d25fb299f@gmail.com>
Date: Tue, 9 Mar 2021 15:49:13 +0200
From: Topi Miettinen <toiwoton@...il.com>
To: Kees Cook <keescook@...omium.org>
Cc: linux-hardening@...r.kernel.org, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Andy Lutomirski <luto@...nel.org>,
Jann Horn <jannh@...gle.com>,
Linux API <linux-api@...r.kernel.org>,
Matthew Wilcox <willy@...radead.org>,
Mike Rapoport <rppt@...nel.org>, Vlad Rezki <urezki@...il.com>
Subject: Re: [PATCH v3] mm/vmalloc: randomize vmalloc() allocations
On 8.3.2021 23.38, Kees Cook wrote:
> On Mon, Feb 15, 2021 at 10:26:34PM +0200, Topi Miettinen wrote:
>> Memory mappings inside kernel allocated with vmalloc() are in
>> predictable order and packed tightly toward the low addresses, except
>> for per-cpu areas which start from top of the vmalloc area. With
>> new kernel boot parameter 'randomize_vmalloc=1', the entire area is
>> used randomly to make the allocations less predictable and harder to
>> guess for attackers. Also module and BPF code locations get randomized
>> (within their dedicated and rather small area though) and if
>> CONFIG_VMAP_STACK is enabled, also kernel thread stack locations.
>>
>> On 32 bit systems this may cause problems due to increased VM
>> fragmentation if the address space gets crowded.
>>
>> On all systems, it will reduce performance and increase memory and
>> cache usage due to less efficient use of page tables and inability to
>> merge adjacent VMAs with compatible attributes. On x86_64 with 5 level
>> page tables, in the worst case, additional page table entries of up to
>> 4 pages are created for each mapping, so with small mappings there's
>> considerable penalty.
>>
>> Without randomize_vmalloc=1:
>> $ grep -v kernel_clone /proc/vmallocinfo
>> 0xffffc90000000000-0xffffc90000009000 36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
>> 0xffffc90000009000-0xffffc9000000b000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
>> 0xffffc9000000c000-0xffffc9000000f000 12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
>> 0xffffc9000000f000-0xffffc90000011000 8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
>> 0xffffc90000011000-0xffffc90000013000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000013000-0xffffc90000015000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000015000-0xffffc90000017000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000021000-0xffffc90000023000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffc90000023000-0xffffc90000025000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
>> 0xffffc90000025000-0xffffc90000027000 8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
>> 0xffffc90000031000-0xffffc90000036000 20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
>> 0xffffc90000043000-0xffffc90000047000 16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
>> 0xffffc90000211000-0xffffc90000232000 135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
>> 0xffffc90000232000-0xffffc90000253000 135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
>> 0xffffc900005a9000-0xffffc900005ba000 69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
>> 0xffffc900005ba000-0xffffc900005cc000 73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
>> 0xffffe8ffffc00000-0xffffe8ffffe00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
>>
>> With randomize_vmalloc=1, the allocations are randomized:
>> $ grep -v kernel_clone /proc/vmallocinfo
>> 0xffffc9759d443000-0xffffc9759d445000 8192 hpet_enable+0x31/0x4a4 phys=0x00000000fed00000 ioremap
>> 0xffffccf1e9f66000-0xffffccf1e9f68000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffcd2fc02a4000-0xffffcd2fc02a6000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffcdaefb898000-0xffffcdaefb89b000 12288 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe0000 ioremap
>> 0xffffcef8074c3000-0xffffcef8074cc000 36864 irq_init_percpu_irqstack+0x176/0x1c0 vmap
>> 0xffffcf725ca2e000-0xffffcf725ca4f000 135168 crypto_scomp_init_tfm+0xc6/0xf0 pages=32 vmalloc
>> 0xffffd0efb25e1000-0xffffd0efb25f2000 69632 pcpu_create_chunk+0x7b/0x260 pages=16 vmalloc
>> 0xffffd27054678000-0xffffd2705467c000 16384 n_tty_open+0x11/0xe0 pages=3 vmalloc
>> 0xffffd2adf716e000-0xffffd2adf7180000 73728 pcpu_create_chunk+0xb2/0x260 pages=17 vmalloc
>> 0xffffd4ba5fb6b000-0xffffd4ba5fb6d000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x000000001ffe1000 ioremap
>> 0xffffded126192000-0xffffded126194000 8192 memremap+0x19c/0x280 phys=0x00000000000f5000 ioremap
>> 0xffffe01a4dbcd000-0xffffe01a4dbcf000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffe4b649952000-0xffffe4b649954000 8192 acpi_os_map_iomem+0x2ac/0x2d0 phys=0x00000000fed00000 ioremap
>> 0xffffe71ed592a000-0xffffe71ed592c000 8192 gen_pool_add_owner+0x49/0x130 pages=1 vmalloc
>> 0xffffe7dc5824f000-0xffffe7dc58270000 135168 crypto_scomp_init_tfm+0x67/0xf0 pages=32 vmalloc
>> 0xffffe8f4f9800000-0xffffe8f4f9a00000 2097152 pcpu_get_vm_areas+0x0/0x2290 vmalloc
>> 0xffffe8f4f9a19000-0xffffe8f4f9a1e000 20480 pcpu_create_chunk+0xe8/0x260 pages=4 vmalloc
>>
>> With CONFIG_VMAP_STACK, also kernel thread stacks are placed in
>> vmalloc area and therefore they also get randomized (only one example
>> line from /proc/vmallocinfo shown for brevity):
>>
>> unrandomized:
>> 0xffffc90000018000-0xffffc90000021000 36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>>
>> randomized:
>> 0xffffcb57611a8000-0xffffcb57611b1000 36864 kernel_clone+0xf9/0x560 pages=8 vmalloc
>>
>> CC: Andrew Morton <akpm@...ux-foundation.org>
>> CC: Andy Lutomirski <luto@...nel.org>
>> CC: Jann Horn <jannh@...gle.com>
>> CC: Kees Cook <keescook@...omium.org>
>> CC: Linux API <linux-api@...r.kernel.org>
>> CC: Matthew Wilcox <willy@...radead.org>
>> CC: Mike Rapoport <rppt@...nel.org>
>> CC: Vlad Rezki <urezki@...il.com>
>> Signed-off-by: Topi Miettinen <toiwoton@...il.com>
>
> Thanks for working on this! I'd like to see this in the kernel, even if
> it's only for the more paranoia crowd. Have you done any workload
> measurements to see how much of a hit this has in the real world?
I compiled the kernel a few times on an idle machine and with
randomize_vmalloc=1 the real times were 0.3% slower. Not sure if this is
a good measure for kernel performance. The randomization should increase
use of page tables and they should also compete for CPU caches.
>> ---
>> v2: retry allocation from other end of vmalloc space in case of
>> failure (Matthew Wilcox), improve commit message and documentation
>> v3: randomize also percpu allocations (pcpu_get_vm_areas())
>> ---
>> .../admin-guide/kernel-parameters.txt | 23 ++++++++++++
>> mm/vmalloc.c | 36 +++++++++++++++++--
>> 2 files changed, 56 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index c722ec19cd00..38d6b5728ccc 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -4024,6 +4024,29 @@
>>
>> ramdisk_start= [RAM] RAM disk image start address
>>
>> + randomize_vmalloc= [KNL] Randomize vmalloc() allocations. With 1,
>> + the entire vmalloc() area is used randomly to
>> + make the allocations less predictable and
>> + harder to guess for attackers. Also module and
>> + BPF code locations get randomized (within
>> + their dedicated and rather small area though)
>> + and if CONFIG_VMAP_STACK is enabled, also
>> + kernel thread stack locations.
>> +
>> + On 32 bit systems this may cause problems due
>> + to increased VM fragmentation if the address
>> + space gets crowded.
>> +
>> + On all systems, it will reduce performance and
>> + increase memory and cache usage due to less
>> + efficient use of page tables and inability to
>> + merge adjacent VMAs with compatible
>> + attributes. On x86_64 with 5 level page
>> + tables, in the worst case, additional page
>> + table entries of up to 4 pages are created for
>> + each mapping, so with small mappings there's
>> + considerable penalty.
>> +
>> random.trust_cpu={on,off}
>> [KNL] Enable or disable trusting the use of the
>> CPU's random number generator (if available) to
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 4d88fe5a277a..1e8e0ee1925f 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -34,6 +34,7 @@
>> #include <linux/bitops.h>
>> #include <linux/rbtree_augmented.h>
>> #include <linux/overflow.h>
>> +#include <linux/random.h>
>>
>> #include <linux/uaccess.h>
>> #include <asm/tlbflush.h>
>> @@ -1089,6 +1090,17 @@ adjust_va_to_fit_type(struct vmap_area *va,
>> return 0;
>> }
>>
>> +static int randomize_vmalloc = 0;
>
> This should be __ro_after_init, and even better, it should be a static
> branch so there's no performance hit at all for the "disabled" case.
> I recommend:
>
> static DEFINE_STATIC_KEY_FALSE_RO(randomize_vmalloc);
>
> static int __init set_randomize_vmalloc(char *str)
> {
> int ret;
> bool bool_result;
>
> ret = kstrtobool(buf, &bool_result);
> if (ret)
> return ret;
>
> if (bool_result)
> static_branch_enable(&randomize_kstack_offset);
> else
> static_branch_disable(&randomize_kstack_offset);
> return 1;
> }
> __setup("randomize_vmalloc=", set_randomize_vmalloc);
>
>
>> +
>> /*
>> * Returns a start address of the newly allocated area, if success.
>> * Otherwise a vend is returned that indicates failure.
>> @@ -1162,7 +1174,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>> int node, gfp_t gfp_mask)
>> {
>> struct vmap_area *va, *pva;
>> - unsigned long addr;
>> + unsigned long addr, voffset;
>> int purged = 0;
>> int ret;
>>
>> @@ -1217,11 +1229,24 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>> if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva))
>> kmem_cache_free(vmap_area_cachep, pva);
>>
>> + /* Randomize allocation */
>> + if (randomize_vmalloc) {
>
> if (static_branch_unlikely(randomize_vmalloc)) {
> ...
Thanks, I'll post an updated version with static_branch.
>
>> + voffset = get_random_long() & (roundup_pow_of_two(vend - vstart) - 1);
>> + voffset = PAGE_ALIGN(voffset);
>> + if (voffset + size > vend - vstart)
>> + voffset = vend - vstart - size;
>> + } else
>> + voffset = 0;
>> +
>> /*
>> * If an allocation fails, the "vend" address is
>> * returned. Therefore trigger the overflow path.
>> */
>> - addr = __alloc_vmap_area(size, align, vstart, vend);
>> + addr = __alloc_vmap_area(size, align, vstart + voffset, vend);
>> +
>> + if (unlikely(addr == vend) && voffset)
>> + /* Retry randomization from other end */
>> + addr = __alloc_vmap_area(size, align, vstart, vstart + voffset + size);
>> spin_unlock(&free_vmap_area_lock);
>>
>> if (unlikely(addr == vend))
>> @@ -3256,7 +3281,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>> start = offsets[area];
>> end = start + sizes[area];
>>
>> - va = pvm_find_va_enclose_addr(vmalloc_end);
>> + if (randomize_vmalloc)
>> + va = pvm_find_va_enclose_addr(vmalloc_start +
>> + (get_random_long() &
>> + (roundup_pow_of_two(vmalloc_end - vmalloc_start) - 1)));
>> + else
>> + va = pvm_find_va_enclose_addr(vmalloc_end);
>> base = pvm_determine_end_from_reverse(&va, align) - end;
>>
>> while (true) {
>> --
>> 2.30.0
>>
>
> But otherwise, yes please. It's a simple change that makes per-boot
> order of vmalloc allocations unpredictable. I'm for it! :)
Great and thanks for the review!
-Topi
Powered by blists - more mailing lists