lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed8cea4c-a261-60ca-f4e1-333ec73cca8f@os.amperecomputing.com>
Date:   Mon, 28 Aug 2023 21:44:50 -0700
From:   Steve Clevenger <scclevenger@...amperecomputing.com>
To:     coresight@...ts.linaro.org, linux-arm-kernel@...ts.infradead.org,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Cc:     yang@...amperecomputing.com
Subject: About ARM64 Kernel Instruction Trace


Ampere has been using the following command sequence to generate a 'perf
record' kernel instruction trace on an AmpereOne AARCH64 based systems
using a 6.3.0 Fedora distribution with CoreSight related patches.


[root@...01sys-b212 kernel]# uname -a
Linux sut01sys-b212.scc-lab.amperecomputing.com 6.3.0-coresight-enabled+
#9 SMP PREEMPT_DYNAMIC Fri Jun 30 12:54:14 PDT 2023 aarch64 aarch64
aarch64 GNU/Linux
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# timeout 45s perf record --kcore -o kcore -e
cs_etm/@..._etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
of=/dev/null
[ perf record: Woken up 41 times to write data ]
[ perf record: Captured and wrote 144.056 MB kcore ]
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# ls -l ~/linux/vmlinux
-rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# timeout 45s perf script --input
./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
cpu,event,ip,addr,sym – -d objdump -k ~/linux/vmlinux > ./perf.itrace
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# ls -l perf.itrace
-rw-r--r--. 1 root root  485834744 Jul 17 14:18 perf.itrace


The executive summary is the code sections in vmlinux do not match the
actual code executing on the target due to self-modifying code
(alternate code sequences) in the kernel image. These sequences are
typically applied over nop place-holder instructions. This was verified
by comparing a location in the ./drivers/char/mem.c read_zero objdump
instruction stream to the memory resident instructions captured with the
TRACE32 ETM instruction trace feature. This particular instruction trace
example shows a disconnect. It displays a ‘b    ffff800008ab8a70
<read_zero+0x168>’ instruction at address 0xffff800008ab89e8 which did
not branch. The actual instruction present at 0xffff800008ab89e8 is a ‘nop’.
        .
        .
        ffff800008ab89a8:       1400000d        b       ffff800008ab89dc
<read_zero+0xd4>
              dd  8774/8774  [0015]         0.000000000  read_zero+0xa0
        ffff800008ab89dc <read_zero+0xd4>:
        ffff800008ab89dc:       9248f840        and     x0, x2,
#0xff7fffffffffffff
        ffff800008ab89e0:       aa1403e1        mov     x1, x20
        ffff800008ab89e4:       9418b6fb        bl      ffff8000090e65d0
<__arch_clear_user>
              dd  8774/8774  [0015]         0.000000000  read_zero+0xdc
ffff8000090e65d0 <__arch_clear_user>:
        ffff8000090e65d0:       d503245f        bti     c
        ffff8000090e65d4:       8b010002        add     x2, x0, x1
        ffff8000090e65d8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65dc:       54000104        b.mi    ffff8000090e65fc
<__arch_clear_user+0x2c>  // b.first
        ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
        ffff8000090e65e4:       91002000        add     x0, x0, #0x8
        ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
<__arch_clear_user+0x10>  // b.pmore
              dd  8774/8774  [0015]         0.000000000
__arch_clear_user+0x1c
...vec/linux/arch/arm64/lib/clear_user.S   31         b.hi    1b
        ffff8000090e65e0 <__arch_clear_user+0x10>:
        ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
        ffff8000090e65e4:       91002000        add     x0, x0, #0x8
        ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
<__arch_clear_user+0x10>  // b.pmore
        .
        .
        ffff8000090e65f0:       f81f885f        sttr    xzr, [x2, #-8]
        ffff8000090e65f4:       d2800000        mov     x0, #0x0
               // #0
        ffff8000090e65f8:       d65f03c0        ret
              dd  8774/8774  [0015]         0.000000000
__arch_clear_user+0x28
...vec/linux/arch/arm64/lib/clear_user.S   34         ret
        ffff800008ab89e8 <read_zero+0xe0>:
        ffff800008ab89e8:       14000022        b       ffff800008ab8a70
<read_zero+0x168>
        ffff800008ab89ec:       d503201f        nop
             dd  8774/8774  [0015]         0.000000000  read_zero+0xe4
                        /home/stevec/linux/drivers/char/mem.c     521
             left = clear_user(buf + cleared, chunk);
        ffff800008ab8a18 <read_zero+0x110>:
        ffff800008ab8a18:       8b14035a        add     x26, x26, x20
        ffff800008ab8a1c:       b5000360        cbnz    x0,
ffff800008ab8a88 <read_zero+0x180>
        ffff800008ab8a20:       f9400320        ldr     x0, [x25]
        ffff800008ab8a24:       cb140273        sub     x19, x19, x20
        .
        .

Ostensibly, the ‘perf record –kcore’ option produces a representative
image of the kernel. But this option does not produce suitable output to
generate ARM64 kernel instruction trace. perf doesn’t disassemble ARM64,
so the arm-cs-trace-disasm.py python script is used with the native
objdump utility to provide ARM64 disassembly from CoreSight trace
capture. objdump itself requires an ELF/DWARF file image + symbols to
generate the symbolic (+ line information for mixed source) disassembly.
The linux vmlinux image + symbols file is typically used for this
purpose. The kallsyms file is not formatted for objdump use. As an
experiment, I patched the executable code sections in a local copy of
vmlinux with the corresponding executable code segments extracted from
the kcore image using an Ampere internal ELF patch utility.

This patch utility leverages the (MIT Licensed) ELFIO open source
library API. These were the commands.


[root@...01sys-b212 kernel]# timeout 30s perf record --kcore -o kcore -e
cs_etm/@..._etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
of=/dev/null
[ perf record: Woken up 25 times to write data ]
[ perf record: Captured and wrote 88.053 MB kcore ]
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# ls -l ~/linux/vmlinux
-rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
[root@...01sys-b212 kernel]# cp ~/linux/vmlinux .
[root@...01sys-b212 kernel]# patch-elf --help

patch-elf overlays the kernel image from a local copy of
'/proc/kcore' to the  corresponding (by address) ELF sections
in a local copy of the vmlinux ELF file.
A local '/proc/kcore' is created by:
    'perf report --kcore -e cs_etm/@..._etr1/k ...'
The local (patched) vmlinux copy is used by:
    'perf script -s arm-cs-trace-disasm.py ...'
See the CoreSight Hardware-Assisted Trace Application Note for
use of the 'perf report' and 'perf script' commands.

Usage: patch-elf <--verbose> kcore_file vmlinux_file
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# patch-elf kcore/kcore_dir/kcore ./vmlinux
ELF File kcore Properties
ELF file class:     ELF64
ELF file encoding:  Little endian
Machine:            ARM AArch64
Type:               Core file
Number of segments: 3
Number of sections: 0

ELF File vmlinux Properties
ELF file class:     ELF64
ELF file encoding:  Little endian
Machine:            ARM AArch64
Type:               Shared object file
Number of segments: 3
Number of sections: 43

Patching section[ 2] ffff800008010000  17997936 bytes

Patching section[15] ffff800009a31000  20480 bytes

Patching section[16] ffff800009a40000  612372 bytes

Patching section[17] ffff800009ad5818  24752 bytes

[root@...01sys-b212 kernel]# ls -l ./vmlinux
-rwxr-xr-x. 1 root root 390426908 Jul 19 11:14 ./vmlinux*
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# timeout 45s perf script --input
./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
cpu,event,ip,addr,sym – -d objdump -k ./vmlinux > ./perf.itrace
[root@...01sys-b212 kernel]#
[root@...01sys-b212 kernel]# ls -l perf.itrace
-rw-r--r--. 1 root root  32142060 Jul 19 11:18 perf.itrace

Here is the representative kernel instruction trace using a patched vmlinux.

        .
        .
        ffff800008ab89a8:       1400000d        b       ffff800008ab89dc
<read_zero+0xd4>
              dd  8774/8774  [0015]         0.000000000  read_zero+0xa0
        ffff800008ab89dc <read_zero+0xd4>:
        ffff800008ab89dc:       9248f840        and     x0, x2,
#0xff7fffffffffffff
        ffff800008ab89e0:       aa1403e1        mov     x1, x20
        ffff800008ab89e4:       9418b6fb        bl      ffff8000090e65d0
<__arch_clear_user>
              dd  8774/8774  [0015]         0.000000000  read_zero+0xdc
ffff8000090e65d0 <__arch_clear_user>:
        ffff8000090e65d0:       d503245f        bti     c
        ffff8000090e65d4:       8b010002        add     x2, x0, x1
        ffff8000090e65d8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65dc:       54000104        b.mi    ffff8000090e65fc
<__arch_clear_user+0x2c>  // b.first
        ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
        ffff8000090e65e4:       91002000        add     x0, x0, #0x8
        ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
<__arch_clear_user+0x10>  // b.pmore
              dd  8774/8774  [0015]         0.000000000
__arch_clear_user+0x1c
        ffff8000090e65e0 <__arch_clear_user+0x10>:
        ffff8000090e65e0:       f800081f        sttr    xzr, [x0]
        ffff8000090e65e4:       91002000        add     x0, x0, #0x8
        ffff8000090e65e8:       f1002021        subs    x1, x1, #0x8
        ffff8000090e65ec:       54ffffa8        b.hi    ffff8000090e65e0
<__arch_clear_user+0x10>  // b.pmore
        .
        .
        ffff8000090e65f0:       f81f885f        sttr    xzr, [x2, #-8]
        ffff8000090e65f4:       d2800000        mov     x0, #0x0
               // #0
        ffff8000090e65f8:       d65f03c0        ret
              dd  8774/8774  [0015]         0.000000000
__arch_clear_user+0x28
...vec/linux/arch/arm64/lib/clear_user.S   34         ret
        ffff800008ab89e8 <read_zero+0xe0>:
        ffff800008ab89e8:       d503201f        nop
        ffff800008ab89ec:       1400000b        b       ffff800008ab8a18
<read_zero+0x110>
              dd  8774/8774  [0015]         0.000000000  read_zero+0xe4
                         /home/stevec/linux/drivers/char/mem.c     521
              left = clear_user(buf + cleared, chunk);
        ffff800008ab8a18 <read_zero+0x110>:
        ffff800008ab8a18:       8b14035a        add     x26, x26, x20
        ffff800008ab8a1c:       b5000360        cbnz    x0,
ffff800008ab8a88 <read_zero+0x180>
        ffff800008ab8a20:       f9400320        ldr     x0, [x25]
        ffff800008ab8a24:       cb140273        sub     x19, x19, x20
        .
        .
        .

This begs the question what perf enhancements could be added to make
ARM64 kernel instruction trace easier to use? The process I’ve followed
is cumbersome, but could be done behind the scenes by perf. The caveat
is it requires a vmlinux which might not be available to an end user.
Here are 2 options.

1.	'perf report -kcore'could use the process I’ve used here
transparently in the background. The plus side is the objdump feature of
mixed disassembly is available based on the current vmlinux.

2.	'perf report -kcore' generates an ELF + symbols file based on
kallsyms (and/or System.map). No vmlinux patching, so intermixed source
and disassembly wouldn’t be available. It’s a reasonable alternative
without relying on vmlinux.

It makes sense performance-wise to use an ARM64 disassembler directly
through perf. perf-script use of the arm-cs-trace-disasm.py python
script can be slow. I’m unfamiliar with the Intel implementation, but
perf-annotate uses objdump. Unfortunately, I can't seem to get annotate
to work for me. A patch operation is still be required if vmlinux is used.

I suspect the CoreSight/perf communities are aware of these issues.
Is there any ongoing work not known to the outside world?

Thanks and regards,
Steve C.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ