netdev - Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <8fbdeff9-23d6-7453-782c-f2ca6880af8a@gmail.com>
Date:   Thu, 18 Jan 2018 09:27:09 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     Martin KaFai Lau <kafai@...com>
Cc:     borkmann@...earbox.net, ast@...nel.org, netdev@...r.kernel.org
Subject: Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to
 monitor socket TX/RX data

On 01/17/2018 02:04 PM, Martin KaFai Lau wrote:
> On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
>> This implements a BPF ULP layer to allow policy enforcement and
>> monitoring at the socket layer. In order to support this a new
>> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
>> the sendmsg/sendpage hook. To attach the policy to sockets a
>> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.


[...]

> Some minor comments/nits below:

Thanks for reviewing!

>> +
>> +static unsigned int smap_do_tx_msg(struct sock *sk,
>> +				   struct smap_psock *psock,
>> +				   struct sk_msg_buff *md)
>> +{
>> +	struct bpf_prog *prog;
>> +	unsigned int rc, _rc;
>> +
>> +	preempt_disable();
> Why preempt_disable() is needed?

If we run a BPF program with preempt enabled my read of BPF
code path is we may race with multiple programs running. For
example program A gets preempted during a map update to an
array map, program B updates same map entry, program A then
updates some piece of the entry. The result is garbage in
the array element. Just one example.

With PREEMPT-RCU the above seems possible.

> 
>> +	rcu_read_lock();
>> +
>> +	/* If the policy was removed mid-send then default to 'accept' */
>> +	prog = READ_ONCE(psock->bpf_tx_msg);
>> +	if (unlikely(!prog)) {
>> +		_rc = SK_PASS;
>> +		goto verdict;
>> +	}
>> +
>> +	bpf_compute_data_pointers_sg(md);
>> +	_rc = (*prog->bpf_func)(md, prog->insnsi);
>> +
>> +verdict:
>> +	rcu_read_unlock();
>> +	preempt_enable();
>> +
>> +	/* Moving return codes from UAPI namespace into internal namespace */
>> +	rc = ((_rc == SK_PASS) ?
>> +	      (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +	      __SK_DROP);
>> +
>> +	return rc;
>> +}
>> +

[...]

>> +out:
>> +	for (i = sg_curr; i < sg_num; ++i) {
>> +		free += sg[i].length;
> free is not init.
> 

Nice catch thanks.

>> +		put_page(sg_page(&sg[i]));
>> +	}
>> +	return free;
>> +}
>> +

>> +
>> +		/* sg_size indicates bytes already allocated and sg_num
>> +		 * is last sg element used. This is used when alloc_sg
> s/alloc_sg/sk_alloc_sg/
> 
>> +		 * partially allocates a scatterlist and then is sent
>> +		 * to wait for memory. In normal case (no memory pressure)
>> +		 * both sg_nun and sg_size are zero.
> s/sg_nun/sg_num/
> 

Will fix both spelling issues.

>> +		 */
>> +		copy = copy - sg_size;
>> +		err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
>> +		if (err) {
>> +			if (err != -ENOSPC)
>> +				goto wait_for_memory;
>> +			copy = sg_size;
>> +		}
>> +

[...]

>> +do_redir:
>> +	/* To avoid deadlock with multiple socks all doing redirects to
>> +	 * each other we must first drop the current sock lock and release
>> +	 * the psock. Then get the redirect socket (assuming it still
>> +	 * exists), take it's lock, and finally do the send here. If the
>> +	 * redirect fails there is nothing to do, we don't want to blame
>> +	 * the sender for remote socket failures. Instead we simply
>> +	 * continue making forward progress.
>> +	 */
>> +	return_mem_sg(sk, sg, sg_num);
>> +	release_sock(sk);
>> +	smap_release_sock(psock, sk);
>> +	copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
>> +	return copied;
> For __SK_REDIRECT case, before returning, should 'msg_data_left(msg)' be checked
> first?  Or msg_data_left(msg) will always be 0 here?
> 

Interesting, yes this would probably be best. Otherwise what happens is we
return with only some of the bytes copied. The application should (must?)
handle this case correctly, but agree might be nice to send as much as
possible here. I'll see if I can work this into a v2 or perhaps I'll do it
as a follow on performance improvement.

Nice observation though for sure.

>> +}
>> +

[...]

>> +	switch (rc) {
>> +	case __SK_PASS:
>> +		lock_sock(sk);
>> +		rc = tcp_sendpage_locked(sk, page, offset, size, flags);
>> +		release_sock(sk);
>> +		break;
>> +	case __SK_REDIRECT:
>> +		smap_release_sock(psock, sk);
> smap_release_sock() is only needed in __SK_REDIRECT case?

Now that I have a test case for this I also caught this with
testing. ;)

> 
>> +		rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
>> +						  &md);
>> +		break;
>> +	case __SK_DROP:
>> +	default:
>> +		rc = -EACCES;
>> +	}
>> +
>> +	return rc;
>> +}

Thanks a lot! Should have a v2 tomorrow after some additional
testing.