[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9d328f0-8201-a2cc-41db-9b472b541824@os.amperecomputing.com>
Date: Tue, 29 Aug 2023 09:20:03 -0700
From: Steve Clevenger <scclevenger@...amperecomputing.com>
To: James Clark <james.clark@....com>, coresight@...ts.linaro.org,
linux-arm-kernel@...ts.infradead.org,
linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Cc: yang@...amperecomputing.com
Subject: Re: About ARM64 Kernel Instruction Trace
Hi James,
On 8/29/2023 5:55 AM, James Clark wrote:
>
>
> On 29/08/2023 05:44, Steve Clevenger wrote:
>>
>> Ampere has been using the following command sequence to generate a 'perf
>> record' kernel instruction trace on an AmpereOne AARCH64 based systems
>> using a 6.3.0 Fedora distribution with CoreSight related patches.
>>
>>
>> [root@...01sys-b212 kernel]# uname -a
>> Linux sut01sys-b212.scc-lab.amperecomputing.com 6.3.0-coresight-enabled+
>> #9 SMP PREEMPT_DYNAMIC Fri Jun 30 12:54:14 PDT 2023 aarch64 aarch64
>> aarch64 GNU/Linux
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# timeout 45s perf record --kcore -o kcore -e
>> cs_etm/@..._etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
>> of=/dev/null
>> [ perf record: Woken up 41 times to write data ]
>> [ perf record: Captured and wrote 144.056 MB kcore ]
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# ls -l ~/linux/vmlinux
>> -rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# timeout 45s perf script --input
>> ./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
>> cpu,event,ip,addr,sym – -d objdump -k ~/linux/vmlinux > ./perf.itrace
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# ls -l perf.itrace
>> -rw-r--r--. 1 root root 485834744 Jul 17 14:18 perf.itrace
>>
>>
>> The executive summary is the code sections in vmlinux do not match the
>> actual code executing on the target due to self-modifying code
>> (alternate code sequences) in the kernel image. These sequences are
>> typically applied over nop place-holder instructions. This was verified
>> by comparing a location in the ./drivers/char/mem.c read_zero objdump
>> instruction stream to the memory resident instructions captured with the
>> TRACE32 ETM instruction trace feature. This particular instruction trace
>> example shows a disconnect. It displays a ‘b ffff800008ab8a70
>> <read_zero+0x168>’ instruction at address 0xffff800008ab89e8 which did
>> not branch. The actual instruction present at 0xffff800008ab89e8 is a ‘nop’.
>> .
>> .
>> ffff800008ab89a8: 1400000d b ffff800008ab89dc
>> <read_zero+0xd4>
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xa0
>> ffff800008ab89dc <read_zero+0xd4>:
>> ffff800008ab89dc: 9248f840 and x0, x2,
>> #0xff7fffffffffffff
>> ffff800008ab89e0: aa1403e1 mov x1, x20
>> ffff800008ab89e4: 9418b6fb bl ffff8000090e65d0
>> <__arch_clear_user>
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xdc
>> ffff8000090e65d0 <__arch_clear_user>:
>> ffff8000090e65d0: d503245f bti c
>> ffff8000090e65d4: 8b010002 add x2, x0, x1
>> ffff8000090e65d8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65dc: 54000104 b.mi ffff8000090e65fc
>> <__arch_clear_user+0x2c> // b.first
>> ffff8000090e65e0: f800081f sttr xzr, [x0]
>> ffff8000090e65e4: 91002000 add x0, x0, #0x8
>> ffff8000090e65e8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65ec: 54ffffa8 b.hi ffff8000090e65e0
>> <__arch_clear_user+0x10> // b.pmore
>> dd 8774/8774 [0015] 0.000000000
>> __arch_clear_user+0x1c
>> ...vec/linux/arch/arm64/lib/clear_user.S 31 b.hi 1b
>> ffff8000090e65e0 <__arch_clear_user+0x10>:
>> ffff8000090e65e0: f800081f sttr xzr, [x0]
>> ffff8000090e65e4: 91002000 add x0, x0, #0x8
>> ffff8000090e65e8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65ec: 54ffffa8 b.hi ffff8000090e65e0
>> <__arch_clear_user+0x10> // b.pmore
>> .
>> .
>> ffff8000090e65f0: f81f885f sttr xzr, [x2, #-8]
>> ffff8000090e65f4: d2800000 mov x0, #0x0
>> // #0
>> ffff8000090e65f8: d65f03c0 ret
>> dd 8774/8774 [0015] 0.000000000
>> __arch_clear_user+0x28
>> ...vec/linux/arch/arm64/lib/clear_user.S 34 ret
>> ffff800008ab89e8 <read_zero+0xe0>:
>> ffff800008ab89e8: 14000022 b ffff800008ab8a70
>> <read_zero+0x168>
>> ffff800008ab89ec: d503201f nop
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xe4
>> /home/stevec/linux/drivers/char/mem.c 521
>> left = clear_user(buf + cleared, chunk);
>> ffff800008ab8a18 <read_zero+0x110>:
>> ffff800008ab8a18: 8b14035a add x26, x26, x20
>> ffff800008ab8a1c: b5000360 cbnz x0,
>> ffff800008ab8a88 <read_zero+0x180>
>> ffff800008ab8a20: f9400320 ldr x0, [x25]
>> ffff800008ab8a24: cb140273 sub x19, x19, x20
>> .
>> .
>>
>> Ostensibly, the ‘perf record –kcore’ option produces a representative
>> image of the kernel. But this option does not produce suitable output to
>> generate ARM64 kernel instruction trace. perf doesn’t disassemble ARM64,
>> so the arm-cs-trace-disasm.py python script is used with the native
>> objdump utility to provide ARM64 disassembly from CoreSight trace
>> capture. objdump itself requires an ELF/DWARF file image + symbols to
>> generate the symbolic (+ line information for mixed source) disassembly.
>> The linux vmlinux image + symbols file is typically used for this
>> purpose. The kallsyms file is not formatted for objdump use. As an
>> experiment, I patched the executable code sections in a local copy of
>> vmlinux with the corresponding executable code segments extracted from
>> the kcore image using an Ampere internal ELF patch utility.
>>
>> This patch utility leverages the (MIT Licensed) ELFIO open source
>> library API. These were the commands.
>>
>>
>> [root@...01sys-b212 kernel]# timeout 30s perf record --kcore -o kcore -e
>> cs_etm/@..._etr63/k --per-thread taskset --cpu-list 15 dd if=/dev/zero
>> of=/dev/null
>> [ perf record: Woken up 25 times to write data ]
>> [ perf record: Captured and wrote 88.053 MB kcore ]
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# ls -l ~/linux/vmlinux
>> -rwxr-xr-x. 1 root root 390426912 Jun 30 12:54 /home/stevec/linux/vmlinux*
>> [root@...01sys-b212 kernel]# cp ~/linux/vmlinux .
>> [root@...01sys-b212 kernel]# patch-elf --help
>>
>> patch-elf overlays the kernel image from a local copy of
>> '/proc/kcore' to the corresponding (by address) ELF sections
>> in a local copy of the vmlinux ELF file.
>> A local '/proc/kcore' is created by:
>> 'perf report --kcore -e cs_etm/@..._etr1/k ...'
>> The local (patched) vmlinux copy is used by:
>> 'perf script -s arm-cs-trace-disasm.py ...'
>> See the CoreSight Hardware-Assisted Trace Application Note for
>> use of the 'perf report' and 'perf script' commands.
>>
>> Usage: patch-elf <--verbose> kcore_file vmlinux_file
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# patch-elf kcore/kcore_dir/kcore ./vmlinux
>> ELF File kcore Properties
>> ELF file class: ELF64
>> ELF file encoding: Little endian
>> Machine: ARM AArch64
>> Type: Core file
>> Number of segments: 3
>> Number of sections: 0
>>
>> ELF File vmlinux Properties
>> ELF file class: ELF64
>> ELF file encoding: Little endian
>> Machine: ARM AArch64
>> Type: Shared object file
>> Number of segments: 3
>> Number of sections: 43
>>
>> Patching section[ 2] ffff800008010000 17997936 bytes
>>
>> Patching section[15] ffff800009a31000 20480 bytes
>>
>> Patching section[16] ffff800009a40000 612372 bytes
>>
>> Patching section[17] ffff800009ad5818 24752 bytes
>>
>> [root@...01sys-b212 kernel]# ls -l ./vmlinux
>> -rwxr-xr-x. 1 root root 390426908 Jul 19 11:14 ./vmlinux*
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# timeout 45s perf script --input
>> ./kcore/data -s ../../scripts/arm-cs-trace-disasm.py -F
>> cpu,event,ip,addr,sym – -d objdump -k ./vmlinux > ./perf.itrace
>> [root@...01sys-b212 kernel]#
>> [root@...01sys-b212 kernel]# ls -l perf.itrace
>> -rw-r--r--. 1 root root 32142060 Jul 19 11:18 perf.itrace
>>
>> Here is the representative kernel instruction trace using a patched vmlinux.
>>
>> .
>> .
>> ffff800008ab89a8: 1400000d b ffff800008ab89dc
>> <read_zero+0xd4>
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xa0
>> ffff800008ab89dc <read_zero+0xd4>:
>> ffff800008ab89dc: 9248f840 and x0, x2,
>> #0xff7fffffffffffff
>> ffff800008ab89e0: aa1403e1 mov x1, x20
>> ffff800008ab89e4: 9418b6fb bl ffff8000090e65d0
>> <__arch_clear_user>
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xdc
>> ffff8000090e65d0 <__arch_clear_user>:
>> ffff8000090e65d0: d503245f bti c
>> ffff8000090e65d4: 8b010002 add x2, x0, x1
>> ffff8000090e65d8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65dc: 54000104 b.mi ffff8000090e65fc
>> <__arch_clear_user+0x2c> // b.first
>> ffff8000090e65e0: f800081f sttr xzr, [x0]
>> ffff8000090e65e4: 91002000 add x0, x0, #0x8
>> ffff8000090e65e8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65ec: 54ffffa8 b.hi ffff8000090e65e0
>> <__arch_clear_user+0x10> // b.pmore
>> dd 8774/8774 [0015] 0.000000000
>> __arch_clear_user+0x1c
>> ffff8000090e65e0 <__arch_clear_user+0x10>:
>> ffff8000090e65e0: f800081f sttr xzr, [x0]
>> ffff8000090e65e4: 91002000 add x0, x0, #0x8
>> ffff8000090e65e8: f1002021 subs x1, x1, #0x8
>> ffff8000090e65ec: 54ffffa8 b.hi ffff8000090e65e0
>> <__arch_clear_user+0x10> // b.pmore
>> .
>> .
>> ffff8000090e65f0: f81f885f sttr xzr, [x2, #-8]
>> ffff8000090e65f4: d2800000 mov x0, #0x0
>> // #0
>> ffff8000090e65f8: d65f03c0 ret
>> dd 8774/8774 [0015] 0.000000000
>> __arch_clear_user+0x28
>> ...vec/linux/arch/arm64/lib/clear_user.S 34 ret
>> ffff800008ab89e8 <read_zero+0xe0>:
>> ffff800008ab89e8: d503201f nop
>> ffff800008ab89ec: 1400000b b ffff800008ab8a18
>> <read_zero+0x110>
>> dd 8774/8774 [0015] 0.000000000 read_zero+0xe4
>> /home/stevec/linux/drivers/char/mem.c 521
>> left = clear_user(buf + cleared, chunk);
>> ffff800008ab8a18 <read_zero+0x110>:
>> ffff800008ab8a18: 8b14035a add x26, x26, x20
>> ffff800008ab8a1c: b5000360 cbnz x0,
>> ffff800008ab8a88 <read_zero+0x180>
>> ffff800008ab8a20: f9400320 ldr x0, [x25]
>> ffff800008ab8a24: cb140273 sub x19, x19, x20
>> .
>> .
>> .
>>
>> This begs the question what perf enhancements could be added to make
>> ARM64 kernel instruction trace easier to use? The process I’ve followed
>> is cumbersome, but could be done behind the scenes by perf. The caveat
>> is it requires a vmlinux which might not be available to an end user.
>> Here are 2 options.
>>
>> 1. 'perf report -kcore'could use the process I’ve used here
>> transparently in the background. The plus side is the objdump feature of
>> mixed disassembly is available based on the current vmlinux.
>>
>
> Hi Steve,
>
> What you're saying makes sense to me. I think #1 sounds best, I'm not
> sure of the use-case where you wanted to make actionable decisions from
> the trace but wouldn't have vmlinux available? Maybe an end-user could
> be missing it, but I can only imagine use cases where you are actively
> building and developing the kernel.
I agree the primary use case is for a kernel developer who will have a
vmlinux. There's a non-zero chance of complaints from a fringe element,
hence the second option. In my opinion the vmlinux solution makes better
sense and offers the best function.
>
>> 2. 'perf report -kcore' generates an ELF + symbols file based on
>> kallsyms (and/or System.map). No vmlinux patching, so intermixed source
>> and disassembly wouldn’t be available. It’s a reasonable alternative
>> without relying on vmlinux.
>>
>> It makes sense performance-wise to use an ARM64 disassembler directly
>> through perf. perf-script use of the arm-cs-trace-disasm.py python
>> script can be slow. I’m unfamiliar with the Intel implementation, but
>> perf-annotate uses objdump. Unfortunately, I can't seem to get annotate
>> to work for me. A patch operation is still required if vmlinux is used.
>>
>
> We have had other reports about arm-cs-trace-disasm.py being slow. I had
> a plan to make it work with disassembly directly from Perf as that is
> available in other modes. For example the annotate mode in perf already
> works on Arm. What error are you getting with it exactly? It might
> require installation of the aarch64 objdump tool if you are running on
> x86, or provide the --objdump option?
My self-hosted trace tests are done natively on Ampere systems. I don't
know if it's the case here, but python performance is compromised for
cases where the runtime is repeatedly loaded during use.
Nothing seems out-of-order with the 'perf annotate' command line, but I
admit I've not used it much. The complaint is a missing DSO (text below).
$ perf annotate --input=kcore/data -k ./vmlinux --itrace=iybxwpe
CS ETM Trace: Missing DSO. Use 'perf archive' or debuginfod to export
data from the traced system. Enable CONFIG_PROC_KCORE or use option -k
/path/to/vmlinux for kernel symbols.
>
>> I suspect the CoreSight/perf communities are aware of these issues.
>> Is there any ongoing work not known to the outside world?
>>
>> Thanks and regards,
>> Steve C.
Powered by blists - more mailing lists