netdev - Re: Help with the BPF verifier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20181105123324.GA11147@kernel.org>
Date:   Mon, 5 Nov 2018 09:33:24 -0300
From:   Arnaldo Carvalho de Melo <acme@...nel.org>
To:     Yonghong Song <yhs@...com>
Cc:     Edward Cree <ecree@...arflare.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Jiri Olsa <jolsa@...hat.com>, Martin Lau <kafai@...com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Linux Networking Development Mailing List 
        <netdev@...r.kernel.org>
Subject: Re: Help with the BPF verifier

Em Fri, Nov 02, 2018 at 09:27:52PM +0000, Yonghong Song escreveu:
> 
> 
> On 11/2/18 8:42 AM, Edward Cree wrote:
> > On 02/11/18 15:02, Arnaldo Carvalho de Melo wrote:
> >> Yeah, didn't work as well:
> > 
> >> And the -vv in 'perf trace' didn't seem to map to further details in the
> >> output of the verifier debug:
> > Yeah for log_level 2 you probably need to make source-level changes to either
> >   perf or libbpf (I think the latter).  It's annoying that essentially no tools
> >   plumb through an option for that, someone should fix them ;-)
> > 
> >> libbpf: -- BEGIN DUMP LOG ---
> >> libbpf:
> >> 0: (bf) r6 = r1
> >> 1: (bf) r1 = r10
> >> 2: (07) r1 += -328
> >> 3: (b7) r7 = 64
> >> 4: (b7) r2 = 64
> >> 5: (bf) r3 = r6
> >> 6: (85) call bpf_probe_read#4
> >> 7: (79) r1 = *(u64 *)(r10 -320)
> >> 8: (15) if r1 == 0x101 goto pc+4
> >>   R0=inv(id=0) R1=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> >> 9: (55) if r1 != 0x2 goto pc+22
> >>   R0=inv(id=0) R1=inv2 R6=ctx(id=0,off=0,imm=0) R7=inv64 R10=fp0,call_-1
> >> 10: (bf) r1 = r6
> >> 11: (07) r1 += 16
> >> 12: (05) goto pc+2
> >> 15: (79) r3 = *(u64 *)(r1 +0)
> >> dereference of modified ctx ptr R1 off=16 disallowed
> > Aha, we at least got a different error message this time.
> > And indeed llvm has done that optimisation, rather than the more obvious
> > 11: r3 = *(u64 *)(r1 +16)
> >   because it wants to have lots of reads share a single insn.  You may be able
> >   to defeat that optimisation by adding compiler barriers, idk.  Maybe someone
> >   with llvm knowledge can figure out how to stop it (ideally, llvm would know
> >   when it's generating for bpf backend and not do that).  -O0?  ¯\_(ツ)_/¯
> 
> The optimization mostly likes below:
>     br1:
>       ...
>       r1 += 16
>       goto merge
>     br2:
>       ...
>       r1 += 20
>       goto merge
>     merge:
>       *(u64 *)(r1 + 0)
> 
> The compiler tries to merge common loads. There is no easy way to
> stop this compiler optimization without turning off a lot of other
> optimizations. The easiest way is to add barriers
>     __asm__ __volatile__("": : :"memory")
> after the ctx memory access to prevent their down stream merging.

Great, this made it work:

cat /home/acme/git/perf/tools/perf/examples/bpf/augmented_raw_syscalls.c
#include <stdio.h>
#include <linux/socket.h>

/* bpf-output associated map */
struct bpf_map SEC("maps") __augmented_syscalls__ = {
	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
	.key_size = sizeof(int),
	.value_size = sizeof(u32),
	.max_entries = __NR_CPUS__,
};

struct syscall_enter_args {
	unsigned long long common_tp_fields;
	long		   syscall_nr;
	unsigned long	   args[6];
};

struct syscall_exit_args {
	unsigned long long common_tp_fields;
	long		   syscall_nr;
	long		   ret;
};

struct augmented_filename {
	unsigned int	size;
	int		reserved;
	char		value[256];
};

#define SYS_OPEN 2
#define SYS_OPENAT 257

SEC("raw_syscalls:sys_enter")
int sys_enter(struct syscall_enter_args *args)
{
	struct {
		struct syscall_enter_args args;
		struct augmented_filename filename;
	} augmented_args;
	unsigned int len = sizeof(augmented_args);
	const void *filename_arg = NULL;

	probe_read(&augmented_args.args, sizeof(augmented_args.args), args);

	switch (augmented_args.args.syscall_nr) {
	case SYS_OPEN:	 filename_arg = (const void *)args->args[0];
			__asm__ __volatile__("": : :"memory");
			 break;
	case SYS_OPENAT: filename_arg = (const void *)args->args[1];
			 break;
	default:
			 return 0;
	}

	if (filename_arg != NULL) {
		augmented_args.filename.reserved = 0;
		augmented_args.filename.size = probe_read_str(&augmented_args.filename.value,
							      sizeof(augmented_args.filename.value),
							      filename_arg);
		if (augmented_args.filename.size < sizeof(augmented_args.filename.value)) {
			len -= sizeof(augmented_args.filename.value) - augmented_args.filename.size;
			len &= sizeof(augmented_args.filename.value) - 1;
		}
	} else {
		len = sizeof(augmented_args.args);
	}

	perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, &augmented_args, len);
	return 0;
}

SEC("raw_syscalls:sys_exit")
int sys_exit(struct syscall_exit_args *args)
{
	struct syscall_exit_args augmented_args;
	probe_read(&augmented_args, sizeof(augmented_args), args);
	return augmented_args.syscall_nr == SYS_OPEN || augmented_args.syscall_nr == SYS_OPENAT;
}

license(GPL);

Now I have both open and openat getting the right 

[root@...et perf]# trace -e tools/perf/examples/bpf/augmented_raw_syscalls.c  sleep 1
LLVM: dumping tools/perf/examples/bpf/augmented_raw_syscalls.o
     0.000 ( 0.010 ms): sleep/15794 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
     0.028 ( 0.005 ms): sleep/15794 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
     0.241 ( 0.005 ms): sleep/15794 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
[root@...et perf]# 

[root@...et perf]# trace -e tools/perf/examples/bpf/augmented_raw_syscalls.o cat /etc/passwd > /dev/null
     0.000 ( 0.010 ms): cat/15918 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC) = 3
     0.028 ( 0.006 ms): cat/15918 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC) = 3
     0.241 ( 0.007 ms): cat/15918 open(filename: /usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
     0.289 ( 0.004 ms): cat/15918 openat(dfd: CWD, filename: /etc/passwd) = 3
[root@...et perf]#

[root@...et perf]# readelf -SW tools/perf/examples/bpf/augmented_raw_syscalls.o
There are 11 section headers, starting at offset 0x438:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .strtab           STRTAB          0000000000000000 000376 0000bd 00      0   0  1
  [ 2] .text             PROGBITS        0000000000000000 000040 000000 00  AX  0   0  4
  [ 3] raw_syscalls:sys_enter PROGBITS        0000000000000000 000040 000138 00  AX  0   0  8
  [ 4] .relraw_syscalls:sys_enter REL             0000000000000000 000360 000010 10     10   3  8
  [ 5] raw_syscalls:sys_exit PROGBITS        0000000000000000 000178 000070 00  AX  0   0  8
  [ 6] maps              PROGBITS        0000000000000000 0001e8 000038 00  WA  0   0  4
  [ 7] license           PROGBITS        0000000000000000 000220 000004 00  WA  0   0  1
  [ 8] version           PROGBITS        0000000000000000 000224 000004 00  WA  0   0  4
  [ 9] .llvm_addrsig     LOOS+0xfff4c03  0000000000000000 000370 000006 00   E 10   0  1
  [10] .symtab           SYMTAB          0000000000000000 000228 000138 18      1   7  8
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)
[root@...et perf]# 

> > Alternatively, your prog looks short enough that maybe you could kick the C
> >   habit and write it directly in eBPF asm, that way no-one is optimising things
> >   behind your back.  (I realise this option won't appeal to everyone ;-)
> 
> The LLVM supports BPF inline assembly as well. Some examples here
> https://github.com/llvm-mirror/llvm/blob/master/test/CodeGen/BPF/inline_asm.ll
> You may try it for selective ctx access to work around some
> compiler optimizations. I personally have not used it yet and actually
> not sure whether it actually works or not :-)

I will take a look at it, but the simple barrier right after the first ctx
access did the trick for me, probably I'll cook some ctx accessor macros to
make this transparent.

Now to make -e open,openat to set up filters via a map, get that
augmented_raw_syscalls.c to be added automatically to the evlist, etc.

- Arnaldo
 
> > The reason the verifier disallows this, iirc, is because it needs to be able
> >   to rewrite the offsets on ctx accesses (see convert_ctx_accesses()) in case >   underlying kernel struct doesn't match the layout of the ctx ABI.  
> To do this
> >   it needs the ctx offset to live entirely in the insn doing the access,
> >   otherwise different paths could lead to the same insn accessing different ctx
> >   offsets with different fixups needed — can't be done.
> > 
> > -Ed
> >