[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b7672a88-459e-53f0-5f99-f4b5ccb0a2dd@redhat.com>
Date: Tue, 15 Jan 2019 17:28:43 -0800
From: Laura Abbott <labbott@...hat.com>
To: Khalid Aziz <khalid.aziz@...cle.com>, juergh@...il.com,
tycho@...ho.ws, jsteckli@...zon.de, ak@...ux.intel.com,
torvalds@...ux-foundation.org, liran.alon@...cle.com,
keescook@...gle.com, konrad.wilk@...cle.com
Cc: deepa.srinivasan@...cle.com, chris.hyser@...cle.com,
tyhicks@...onical.com, dwmw@...zon.co.uk,
andrew.cooper3@...rix.com, jcm@...hat.com,
boris.ostrovsky@...cle.com, kanth.ghatraju@...cle.com,
joao.m.martins@...cle.com, jmattson@...gle.com,
pradeep.vincent@...cle.com, john.haxby@...cle.com,
tglx@...utronix.de, kirill.shutemov@...ux.intel.com, hch@....de,
steven.sistare@...cle.com, kernel-hardening@...ts.openwall.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame
Ownership
On 1/10/19 1:09 PM, Khalid Aziz wrote:
> I am continuing to build on the work Juerg, Tycho and Julian have done
> on XPFO. After the last round of updates, we were seeing very
> significant performance penalties when stale TLB entries were flushed
> actively after an XPFO TLB update. Benchmark for measuring performance
> is kernel build using parallel make. To get full protection from
> ret2dir attackes, we must flush stale TLB entries. Performance
> penalty from flushing stale TLB entries goes up as the number of
> cores goes up. On a desktop class machine with only 4 cores,
> enabling TLB flush for stale entries causes system time for "make
> -j4" to go up by a factor of 2.614x but on a larger machine with 96
> cores, system time with "make -j60" goes up by a factor of 26.366x!
> I have been working on reducing this performance penalty.
>
> I implemented a solution to reduce performance penalty and
> that has had large impact. When XPFO code flushes stale TLB entries,
> it does so for all CPUs on the system which may include CPUs that
> may not have any matching TLB entries or may never be scheduled to
> run the userspace task causing TLB flush. Problem is made worse by
> the fact that if number of entries being flushed exceeds
> tlb_single_page_flush_ceiling, it results in a full TLB flush on
> every CPU. A rogue process can launch a ret2dir attack only from a
> CPU that has dual mapping for its pages in physmap in its TLB. We
> can hence defer TLB flush on a CPU until a process that would have
> caused a TLB flush is scheduled on that CPU. I have added a cpumask
> to task_struct which is then used to post pending TLB flush on CPUs
> other than the one a process is running on. This cpumask is checked
> when a process migrates to a new CPU and TLB is flushed at that
> time. I measured system time for parallel make with unmodified 4.20
> kernel, 4.20 with XPFO patches before this optimization and then
> again after applying this optimization. Here are the results:
>
> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
> make -j60 all
>
> 4.20 915.183s
> 4.20+XPFO 24129.354s 26.366x
> 4.20+XPFO+Deferred flush 1216.987s 1.330xx
>
>
> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
> make -j4 all
>
> 4.20 607.671s
> 4.20+XPFO 1588.646s 2.614x
> 4.20+XPFO+Deferred flush 794.473s 1.307xx
>
> 30+% overhead is still very high and there is room for improvement.
> Dave Hansen had suggested batch updating TLB entries and Tycho had
> created an initial implementation but I have not been able to get
> that to work correctly. I am still working on it and I suspect we
> will see a noticeable improvement in performance with that. In the
> code I added, I post a pending full TLB flush to all other CPUs even
> when number of TLB entries being flushed on current CPU does not
> exceed tlb_single_page_flush_ceiling. There has to be a better way
> to do this. I just haven't found an efficient way to implemented
> delayed limited TLB flush on other CPUs.
>
> I am not entirely sure if switch_mm_irqs_off() is indeed the right
> place to perform the pending TLB flush for a CPU. Any feedback on
> that will be very helpful. Delaying full TLB flushes on other CPUs
> seems to help tremendously, so if there is a better way to implement
> the same thing than what I have done in patch 16, I am open to
> ideas.
>
> Performance with this patch set is good enough to use these as
> starting point for further refinement before we merge it into main
> kernel, hence RFC.
>
> Since not flushing stale TLB entries creates a false sense of
> security, I would recommend making TLB flush mandatory and eliminate
> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
> flushing by default for XPFO page table modifications").
>
> What remains to be done beyond this patch series:
>
> 1. Performance improvements
> 2. Remove xpfotlbflush parameter
> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
> from Juerg. I dropped it for now since swiotlb code for ARM has
> changed a lot in 4.20.
> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
> CPUs" to other architectures besides x86.
>
>
> ---------------------------------------------------------
>
> Juerg Haefliger (5):
> mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
> swiotlb: Map the buffer if it was unmapped by XPFO
> arm64/mm: Add support for XPFO
> arm64/mm, xpfo: temporarily map dcache regions
> lkdtm: Add test for XPFO
>
> Julian Stecklina (4):
> mm, x86: omit TLB flushing by default for XPFO page table
> modifications
> xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
> xpfo, mm: optimize spinlock usage in xpfo_kunmap
> EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap
>
> Khalid Aziz (2):
> xpfo, mm: Fix hang when booting with "xpfotlbflush"
> xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
>
> Tycho Andersen (5):
> mm: add MAP_HUGETLB support to vm_mmap
> x86: always set IF before oopsing from page fault
> xpfo: add primitives for mapping underlying memory
> arm64/mm: disable section/contiguous mappings if XPFO is enabled
> mm: add a user_virt_to_phys symbol
>
> .../admin-guide/kernel-parameters.txt | 2 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/Makefile | 2 +
> arch/arm64/mm/flush.c | 7 +
> arch/arm64/mm/mmu.c | 2 +-
> arch/arm64/mm/xpfo.c | 58 ++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/pgtable.h | 26 ++
> arch/x86/include/asm/tlbflush.h | 1 +
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/fault.c | 10 +
> arch/x86/mm/pageattr.c | 23 +-
> arch/x86/mm/tlb.c | 27 ++
> arch/x86/mm/xpfo.c | 171 ++++++++++++
> drivers/misc/lkdtm/Makefile | 1 +
> drivers/misc/lkdtm/core.c | 3 +
> drivers/misc/lkdtm/lkdtm.h | 5 +
> drivers/misc/lkdtm/xpfo.c | 194 ++++++++++++++
> include/linux/highmem.h | 15 +-
> include/linux/mm.h | 2 +
> include/linux/mm_types.h | 8 +
> include/linux/page-flags.h | 13 +
> include/linux/sched.h | 9 +
> include/linux/xpfo.h | 90 +++++++
> include/trace/events/mmflags.h | 10 +-
> kernel/dma/swiotlb.c | 3 +-
> mm/Makefile | 1 +
> mm/mmap.c | 19 +-
> mm/page_alloc.c | 3 +
> mm/util.c | 32 +++
> mm/xpfo.c | 247 ++++++++++++++++++
> security/Kconfig | 29 ++
> 32 files changed, 974 insertions(+), 43 deletions(-)
> create mode 100644 arch/arm64/mm/xpfo.c
> create mode 100644 arch/x86/mm/xpfo.c
> create mode 100644 drivers/misc/lkdtm/xpfo.c
> create mode 100644 include/linux/xpfo.h
> create mode 100644 mm/xpfo.c
>
So this seems to blow up immediately on my arm64 box with a config
based on Fedora:
[ 11.008243] Unable to handle kernel paging request at virtual address ffff8003f8602f9b
[ 11.016133] Mem abort info:
[ 11.018926] ESR = 0x96000007
[ 11.021967] Exception class = DABT (current EL), IL = 32 bits
[ 11.027858] SET = 0, FnV = 0
[ 11.030904] EA = 0, S1PTW = 0
[ 11.034030] Data abort info:
[ 11.036896] ISV = 0, ISS = 0x00000007
[ 11.040718] CM = 0, WnR = 0
[ 11.043672] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
[ 11.050523] [ffff8003f8602f9b] pgd=00000043ffff7803, pud=00000043fe113803, pmd=00000043fc376803, pte=00e80043f8602f13
[ 11.061094] Internal error: Oops: 96000007 [#3] SMP
[ 11.065948] Modules linked in: xfs libcrc32c sdhci_of_arasan sdhci_pltfm sdhci i2c_xgene_slimpro cqhci gpio_dwapb xhci_plat_hcd gpio_xgene_sb gpio_keys
[ 11.079454] CPU: 3 PID: 577 Comm: systemd-getty-g Tainted: G D 4.20.0-xpfo+ #9
[ 11.087936] Hardware name: www.apm.com American Megatrends/American Megatrends, BIOS 3.07.06 20/03/2015
[ 11.097285] pstate: 00400005 (nzcv daif +PAN -UAO)
[ 11.102057] pc : __memcpy+0x20/0x180
[ 11.105616] lr : __access_remote_vm+0x7c/0x1f0
[ 11.110036] sp : ffff000011cb3c20
[ 11.113333] x29: ffff000011cb3c20 x28: ffff8003f8602000
[ 11.118619] x27: 0000000000000f9b x26: 0000000000001000
[ 11.123904] x25: 000083ffffffffff x24: cccccccccccccccd
[ 11.129189] x23: ffff8003d7c53000 x22: 0000000000000044
[ 11.134474] x21: 0000fffff0591f9b x20: 0000000000000044
[ 11.139759] x19: 0000000000000044 x18: 0000000000000000
[ 11.145044] x17: 0000000000000002 x16: 0000000000000000
[ 11.150329] x15: 0000000000000000 x14: 0000000000000000
[ 11.155614] x13: 0000000000000000 x12: 0000000000000000
[ 11.160899] x11: 0000000000000000 x10: 0000000000000000
[ 11.166184] x9 : 0000000000000000 x8 : 0000000000000000
[ 11.171469] x7 : 0000000000000000 x6 : ffff8003d7c53000
[ 11.176754] x5 : 00e00043f8602fd3 x4 : 0000000000000005
[ 11.182038] x3 : 00000003f8602000 x2 : 000000000000003f
[ 11.187323] x1 : ffff8003f8602f9b x0 : ffff8003d7c53000
[ 11.192609] Process systemd-getty-g (pid: 577, stack limit = 0x(____ptrval____))
[ 11.199967] Call trace:
[ 11.202400] __memcpy+0x20/0x180
[ 11.205611] access_remote_vm+0x4c/0x60
[ 11.209428] environ_read+0x12c/0x260
[ 11.213071] __vfs_read+0x48/0x158
[ 11.216454] vfs_read+0x94/0x150
[ 11.219665] ksys_read+0x54/0xb0
[ 11.222875] __arm64_sys_read+0x24/0x30
[ 11.226691] el0_svc_handler+0x94/0x110
[ 11.230508] el0_svc+0x8/0xc
[ 11.233375] Code: f2400c84 540001c0 cb040042 36000064 (38401423)
[ 11.239439] ---[ end trace 4132d3416fb70591 ]---
I'll see if I get some time tomorrow to dig into this unless
someone spots a problem sooner.
Thanks,
Laura
Powered by blists - more mailing lists