[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210914151724.GA34977@e124191.cambridge.arm.com>
Date: Tue, 14 Sep 2021 16:17:24 +0100
From: Joey Gouly <joey.gouly@....com>
To: Mark Rutland <mark.rutland@....com>
Cc: Zhen Lei <thunder.leizhen@...wei.com>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>,
linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
linux-kernel@...r.kernel.org, nd@....com
Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls
Hi,
On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
>
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
>
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
>
> 0000000000001010 <el0t_64_sync_handler>:
> 1010: d503245f bti c
> 1014: d503233f paciasp
> 1018: a9bf7bfd stp x29, x30, [sp, #-16]!
> 101c: 910003fd mov x29, sp
> 1020: d5385201 mrs x1, esr_el1
> 1024: 90000002 adrp x2, 0 <el0t_64_sync_handlers>
> 1028: 531a7c23 lsr w3, w1, #26
> 102c: 91000042 add x2, x2, #:lo12:<el0t_64_sync_handlers>
> 1030: f8637842 ldr x2, [x2, x3, lsl #3]
> 1034: d63f0040 blr x2
> 1038: a8c17bfd ldp x29, x30, [sp], #16
> 103c: d50323bf autiasp
> 1040: d65f03c0 ret
>
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...
I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)
I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:
gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static
These are the results I got from benchmarking on my MacBook Air M1, with
the following command:
./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done
The kernel was based on arm64_defconfig that was then stripped of as much as possible.
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.
gcc:
master: 0.14300
switch-table: 0.14350
likely: 0.13962
clang:
master: 0.14354
switch-table: 0.14642
likely: 0.14256
The generated code looks similar to what Leizhen has posted, so I didn't
post it again.
So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).
Thanks,
Joey
Powered by blists - more mailing lists