linux-kernel - kCFI && patchable-function-entry=M,N

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1LBGZPMfCZ8A1bl@FVFF77S0Q05N>
Date:   Fri, 21 Oct 2022 16:56:20 +0100
From:   Mark Rutland <mark.rutland@....com>
To:     llvm@...ts.linux.dev
Cc:     linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        Fangrui Song <maskray@...gle.com>,
        Joao Moreira <joao@...rdrivepizza.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Kees Cook <keescook@...omium.org>,
        Nathan Chancellor <nathan@...nel.org>,
        Nick Desaulniers <ndesaulniers@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Sami Tolvanen <samitolvanen@...gle.com>
Subject: kCFI && patchable-function-entry=M,N

Hi,

For arm64, I'd like to use -fatchable-function-entry=M,N (where N > 0), for our
ftrace implementation, which instruments *some* but not all functions.
Unfortuntately, this doesn't play nicely with -fsanitize=kcfi, as instrumented
and non-instrumented functions don't agree on where the type hash should live
relative to the function entry point, making them incompatible with one another.
AFAICT, there's no mechanism today to get them to agree.

Today we use -fatchable-function-entry=2, which happens to avoid this.

For example, in the below, functions with N=0 expect the type hash to be placed
4 bytes before the entry point, and functions with N=2 expect the type hash to
be placed 12 bytes before the entry point.

| % cat test.c
| #define __notrace       __attribute__((patchable_function_entry(0, 0)))
| 
| void callee_patchable(void)
| {
| }
| 
| void __notrace callee_non_patchable(void)
| {
| }
| 
| typedef void (*callee_fn_t)(void);
| 
| void caller_patchable(callee_fn_t callee)
| {
|         callee();
| }
| 
| void __notrace caller_non_patchable(callee_fn_t callee)
| {
|         callee();
| }
| % clang --target=aarch64-linux -c test.c -fpatchable-function-entry=2,2 -fsanitize=kcfi -O2
| % aarch64-linux-objdump -d test.o
| 
| test.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <callee_patchable-0xc>:
|    0:   a540670c        .word   0xa540670c
|    4:   d503201f        nop
|    8:   d503201f        nop
| 
| 000000000000000c <callee_patchable>:
|    c:   d65f03c0        ret
|   10:   a540670c        .word   0xa540670c
| 
| 0000000000000014 <callee_non_patchable>:
|   14:   d65f03c0        ret
|   18:   07d85f31        .word   0x07d85f31
|   1c:   d503201f        nop
|   20:   d503201f        nop
| 
| 0000000000000024 <caller_patchable>:
|   24:   b85f4010        ldur    w16, [x0, #-12]
|   28:   728ce191        movk    w17, #0x670c
|   2c:   72b4a811        movk    w17, #0xa540, lsl #16
|   30:   6b11021f        cmp     w16, w17
|   34:   54000040        b.eq    3c <caller_patchable+0x18>  // b.none
|   38:   d4304400        brk     #0x8220
|   3c:   d61f0000        br      x0
|   40:   07d85f31        .word   0x07d85f31
| 
| 0000000000000044 <caller_non_patchable>:
|   44:   b85fc010        ldur    w16, [x0, #-4]
|   48:   728ce191        movk    w17, #0x670c
|   4c:   72b4a811        movk    w17, #0xa540, lsl #16
|   50:   6b11021f        cmp     w16, w17
|   54:   54000040        b.eq    5c <caller_non_patchable+0x18>  // b.none
|   58:   d4304400        brk     #0x8220
|   5c:   d61f0000        br      x0

On arm64, I'd like to use -fpatchable-function-entry=4,2 on arm64, along with
-falign-functions=8, so that we can place a naturally-aligned 8-byte literal
before the function (e.g. a pointer value). That allows us to implement an
efficient per-callsite hook without hitting branch range limitations and/or
requiring trampolines. I have a PoC that works without kCFI at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops

I mentioned this in the #clangbuiltlinux IRQ channel, and Sami wrote up a
github issue at: https://github.com/ClangBuiltLinux/linux/issues/1744

Currently clang generates:

| 	< HASH >
| 	NOP
| 	NOP
| func:	BTI	// optional
| 	NOP
| 	NOP

... and to make this consistent with non-instrumented functions, the
non-instrumented functions would need pre-function padding before their hashes.

For my use-case, it doesn't matter where the pre-function NOPs are placed
relative to the type hash, so long as the location is consistent, and it might
be nicer to have the option to place the pre-function NOPs before the hash,
which would avoid needing to change non-instrumented functions (and save some
space) e.g.

| 	NOP
| 	NOP
| 	< HASH >
| func:	BTI	// optional
| 	NOP
| 	NOP

... but I understand that for x86, folk want the pre-function NOPs to
fall-through into the body of the function.

Is there any mechanism today that we could use to solve this, or could we
extend clang to have some options to control this behaviour?

It would also be helpful to have a symbol before both the hash and pre-function
NOPs so that we can filter those out of probes patching (I see that x86 does
this with the __cfi_function symbol).

Thanks,
Mark.