[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1530923744-25687-1-git-send-email-rick.p.edgecombe@intel.com>
Date: Fri, 6 Jul 2018 17:35:41 -0700
From: Rick Edgecombe <rick.p.edgecombe@...el.com>
To: tglx@...utronix.de, mingo@...hat.com, hpa@...or.com,
x86@...nel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kernel-hardening@...ts.openwall.com
Cc: kristen@...ux.intel.com, dave.hansen@...el.com,
arjan@...ux.intel.com, Rick Edgecombe <rick.p.edgecombe@...el.com>
Subject: [PATCH RFC V2 0/3] KASLR feature to randomize each loadable module
Hi,
This is v2 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization and makes the modules randomized
in relation to each other instead of just the base, so that if one module leaks,
the location of the others can't be inferred.
This code needs some refactoring and simplification, but I was hoping to get
some feedback on the benchmarks and provide an update.
V2 brings the TLB flushes down to close to the existing algorithm and increases
the modules that get high randomness based on the concerns raised by Jann Horn
about the BPF JIT use case. It also tries to address Kees Cook's comments about
possible minimal boot time regression by measuring the average allocation time
to be below the existing allocator. It also addresses Mathew Wilcox's comment
on the GFP_NOWARN not being needed. There is also some data on PTE memory use
which is higher than the original algorithm, as suggested by Jann.
This is off of 4.18-RC3.
Changes since v1:
- New implementation of __vmalloc_node_try_addr based on the
__vmalloc_node_range implementation, that only flushes TLB when needed.
- Modified module loading algorithm to try to reduce the TLB flushes further.
- Increase "random area" tries in order to increase the number of modules that
can get high randomness.
- Increase "random area" size to 2/3 of module area in order to increase the
number of modules that can get high randomness.
- Fix for 0day failures on other architectures.
- Fix for wrong debugfs permissions. (thanks to Jann)
- Spelling fix .(thanks to Jann)
- Data on module_alloc performance and TLB flushes. (brought up by Kees and
Jann)
- Data on memory usage. (suggested by Jann)
Todo:
- Refactor __vmalloc_node_try_addr to be smaller and share more code with the
normal vmalloc path, and misc cleanup
- More real world testing other than the synthetic micro benchmarks described
below. BPF kselftest brought up by Daniel Borkman.
New Algorithm
=============
In addition to __vmalloc_node_try_addr only purging the lazy free areas when it
needs to, it also now supports a mode where it will fail to allocate instead of
doing any purge. In this case it reports when the allocation would have
succeeded if it was allowed to purge. It returns this information via an
ERR_PTR.
The logic for the selection of a location in the random ara is changed as well.
The number of tries is increased from 10 to 10000, which actually still gives
good performance. At a high level, the vmalloc algorithm quickly traverses an
RB-Tree to find a start position and then more slowly traverses a link list to
look for an open spot. Since this new algorithm randomly picks a spot, it
mostly just needs to traverse the RB-Tree, and as a result the "tries" are fast
enough that the number can be high and still be faster than traversing the
linked list. In the data below you can see that the random algorithm is on
average actually faster than the existing one.
The increase in number of tries is also to support the BPF JIT use case, by
increasing the number of modules that can get high randomness.
Since the __vmalloc_node_try_addr now can optionally fail instead of purging,
for the first half of the tries, the algorithm tries to find a spot where it
doesn't need to do a purge. For the second half it allows purges. The 50:50
split is to try to be a happy medium between reducing TLB flushes and reducing
average allocation time.
Randomness
==========
In the last patchset the size of the random area used in the calculations was
incorrect. The entropy should have been closer to 17 bits, not 18, which is why
its lower here even though the number of random area tries is cranked up. 17.3
bits is likely maintained to much higher number of allocations than shown here
in reality, since it seems that the BPF JIT allocations are usually smaller than
modules. If you assume the majority of allocations are 1 page, 17 bits is
maintained to 8000 modules.
Modules Min Info
1000 17.3
2000 17.3
3000 17.3
4000 17.3
5000 17.08
6000 16.30
7000 15.67
8000 14.92
Allocation Time
===============
The average module_alloc time over N modules was actually always faster with the
random algorithm:
Modules Existing(ns) New(ns)
1000 4,761 1,134
2000 9,730 1,149
3000 15,572 1,396
4000 20,723 2,161
5000 26,206 4,349
6000 31,374 8,615
7000 36,123 14,009
8000 40,174 23,396
Average Nth Allocation time was usually better than the existing algorithm,
until the modules get very high.
Module Original(ns) New(ns)
1000 8,800 1,288
2000 20,949 1,477
3000 31,980 2,583
4000 44,539 9,250
5000 55,212 25,986
6000 65,968 39,540
7000 74,883 57,798
8000 85,392 97,319
TLB Flushes Per Allocation
==========================
The new algorithm flushes the TLB a little bit more than the existing algorithm.
For the sake of comparison to the old simpler __vmalloc_node_try_addr
implementation, this is about a 238x improvement in some cases.
Modules Existing New
1000 0.0014 0.001407
2000 0.001746 0.0018155
3000 0.0018186667 0.0021186667
4000 0.00187525 0.00249675
5000 0.001897 0.0025334
6000 0.0019066667 0.0025228333
7000 0.001925 0.0025315714
8000 0.0019325 0.002553
Memory Usage
============
A downside is that since the random area is fragmented, it uses extra PTE pages.
It first approaches 1.3 MB of PTEs as the random area fills. After that it
increases more slowly. I am not sure this can be improved without reducing
randomness.
Modules Existing(pages) New(pages)
100 6 159
200 11 240
300 15 285
400 20 307
500 23 315
1000 41 330
2000 80 335
3000 118 338
Module Capacity
===============
The estimate of module capacity also now goes back up to ~17000, so the real
value should be close to the existing algorithm.
Rick Edgecombe (3):
vmalloc: Add __vmalloc_node_try_addr function
x86/modules: Increase randomization for modules
vmalloc: Add debugfs modfraginfo
arch/x86/include/asm/pgtable_64_types.h | 1 +
arch/x86/kernel/module.c | 103 +++++++++++-
include/linux/vmalloc.h | 3 +
mm/vmalloc.c | 275 +++++++++++++++++++++++++++++++-
4 files changed, 375 insertions(+), 7 deletions(-)
--
2.7.4
Powered by blists - more mailing lists