netdev - Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Jan 2018 18:25:47 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     borkmann@...earbox.net, ast@...nel.org, netdev@...r.kernel.org,
        kafai@...com
Subject: Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to
 monitor socket TX/RX data

On Fri, Jan 12, 2018 at 10:11:11AM -0800, John Fastabend wrote:
> This implements a BPF ULP layer to allow policy enforcement and
> monitoring at the socket layer. In order to support this a new
> program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
> the sendmsg/sendpage hook. To attach the policy to sockets a
> sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
> 
> Similar to previous sockmap usages when a sock is added to a
> sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
> program type attached then the BPF ULP layer is created on the
> socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
> every msg in sendmsg case and page/offset in sendpage case.
> 
> BPF_PROG_TYPE_SK_MSG Semantics/API:
> 
> BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
> SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
> case and in the sendpage case leaves the data untouched. Both cases
> return -EACESS to the user. Returning SK_PASS will allow the msg to
> be sent.
> 
> In the sendmsg case data is copied into kernel space buffers before
> running the BPF program. In the sendpage case data is never copied.
> The implication being users may change data after BPF programs run in
> the sendpage case. (A flag will be added to always copy shortly
> if the copy must always be performed).
> 
> The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
> in the sendmsg() case and the entire page/offset in the sendpage case.
> This avoid ambiguity on how to handle mixed return codes in the
> sendmsg case. The readable/writeable data provided to the program
> in the sendmsg case may not be the entire message, in fact for
> large sends this is likely the case. The data range that can be
> read is part of the sk_msg_md structure. This is because similar
> to the tc bpf_cls case the data is stored in a scatter gather list.
> Future work will address this short-coming to allow users to pull
> in more data if needed (similar to TC BPF).
> 
> The helper msg_redirect_map() can be used to select the socket to
> send the data on. This is used similar to existing redirect use
> cases. This allows policy to redirect msgs.
> 
> Pseudo code simple example:
> 
> The basic logic to attach a program to a socket is as follows,
> 
>   // load the programs
>   bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
> 		&obj, &msg_prog);
> 
>   // lookup the sockmap
>   bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
> 
>   // get fd for sockmap
>   map_fd_msg = bpf_map__fd(bpf_map_msg);
> 
>   // attach program to sockmap
>   bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
> 
> Adding sockets to the map is done in the normal way,
> 
>   // Add a socket 'fd' to sockmap at location 'i'
>   bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
> 
> After the above any socket attached to "my_sock_map", in this case
> 'fd', will run the BPF msg verdict program (msg_prog) on every
> sendmsg and sendpage system call.
> 
> For a complete example see BPF selftests bpf/sockmap_tcp_msg_*.c and
> test_maps.c
> 
> Implementation notes:
> 
> It seemed the simplest, to me at least, to use a refcnt to ensure
> psock is not lost across the sendmsg copy into the sg, the bpf program
> running on the data in sg_data, and the final pass to the TCP stack.
> Some performance testing may show a better method to do this and avoid
> the refcnt cost, but for now use the simpler method.
> 
> Another item that will come after basic support is in place is
> supporting MSG_MORE flag. At the moment we call sendpages even if
> the MSG_MORE flag is set. An enhancement would be to collect the
> pages into a larger scatterlist and pass down the stack. Notice that
> bpf_tcp_sendmsg() could support this with some additional state saved
> across sendmsg calls. I built the code to support this without having
> to do refactoring work. Other flags TBD include ZEROCOPY flag.
> 
> Yet another detail that needs some thought is the size of scatterlist.
> Currently, we use MAX_SKB_FRAGS simply because this was being used
> already in the TLS case. Future work to improve the kernel sk APIs to
> tune this depending on workload may be useful. This is a trade-off
> between memory usage and B/s performance.
> 
> Signed-off-by: John Fastabend <john.fastabend@...il.com>

overall design looks clean. imo huge improvement from first version.

Few nits:

> ---
>  include/linux/bpf.h       |    1 
>  include/linux/bpf_types.h |    1 
>  include/linux/filter.h    |   10 +
>  include/net/tcp.h         |    2 
>  include/uapi/linux/bpf.h  |   28 +++
>  kernel/bpf/sockmap.c      |  485 ++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/bpf/syscall.c      |   14 +
>  kernel/bpf/verifier.c     |    5 
>  net/core/filter.c         |  106 ++++++++++
>  9 files changed, 638 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9e03046..14cdb4d 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -21,6 +21,7 @@
>  struct perf_event;
>  struct bpf_prog;
>  struct bpf_map;
> +struct sock;
>  
>  /* map is generic key/value storage optionally accesible by eBPF programs */
>  struct bpf_map_ops {
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 19b8349..5e2e8a4 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -13,6 +13,7 @@
>  BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
>  #endif
>  #ifdef CONFIG_BPF_EVENTS
>  BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 425056c..f1e9833 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -507,6 +507,15 @@ struct xdp_buff {
>  	struct xdp_rxq_info *rxq;
>  };
>  
> +struct sk_msg_buff {
> +	void *data;
> +	void *data_end;
> +	struct scatterlist sg_data[MAX_SKB_FRAGS];
> +	__u32 key;
> +	__u32 flags;
> +	struct bpf_map *map;
> +};
> +
>  /* Compute the linear packet data range [data, data_end) which
>   * will be accessed by various program types (cls_bpf, act_bpf,
>   * lwt, ...). Subsystems allowing direct data access must (!)
> @@ -769,6 +778,7 @@ int xdp_do_redirect(struct net_device *dev,
>  void bpf_warn_invalid_xdp_action(u32 act);
>  
>  struct sock *do_sk_redirect_map(struct sk_buff *skb);
> +struct sock *do_msg_redirect_map(struct sk_msg_buff *md);
>  
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index a99ceb8..7f56c3c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1984,6 +1984,7 @@ static inline void tcp_listendrop(const struct sock *sk)
>  
>  enum {
>  	TCP_ULP_TLS,
> +	TCP_ULP_BPF,
>  };
>  
>  struct tcp_ulp_ops {
> @@ -2001,6 +2002,7 @@ struct tcp_ulp_ops {
>  int tcp_register_ulp(struct tcp_ulp_ops *type);
>  void tcp_unregister_ulp(struct tcp_ulp_ops *type);
>  int tcp_set_ulp(struct sock *sk, const char *name);
> +int tcp_set_ulp_id(struct sock *sk, const int ulp);
>  void tcp_get_available_ulp(char *buf, size_t len);
>  void tcp_cleanup_ulp(struct sock *sk);
>  
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 405317f..bf649ae 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -133,6 +133,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_SOCK_OPS,
>  	BPF_PROG_TYPE_SK_SKB,
>  	BPF_PROG_TYPE_CGROUP_DEVICE,
> +	BPF_PROG_TYPE_SK_MSG,
>  };
>  
>  enum bpf_attach_type {
> @@ -143,6 +144,7 @@ enum bpf_attach_type {
>  	BPF_SK_SKB_STREAM_PARSER,
>  	BPF_SK_SKB_STREAM_VERDICT,
>  	BPF_CGROUP_DEVICE,
> +	BPF_SK_MSG_VERDICT,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -687,6 +689,15 @@ enum bpf_attach_type {
>   * int bpf_override_return(pt_regs, rc)
>   *	@pt_regs: pointer to struct pt_regs
>   *	@rc: the return value to set
> + *
> + * int bpf_msg_redirect_map(map, key, flags)
> + *     Redirect msg to a sock in map using key as a lookup key for the
> + *     sock in map.
> + *     @map: pointer to sockmap
> + *     @key: key to lookup sock in map
> + *     @flags: reserved for future use
> + *     Return: SK_PASS
> + *
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -747,7 +758,8 @@ enum bpf_attach_type {
>  	FN(perf_event_read_value),	\
>  	FN(perf_prog_read_value),	\
>  	FN(getsockopt),			\
> -	FN(override_return),
> +	FN(override_return),		\
> +	FN(msg_redirect_map),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -909,6 +921,20 @@ enum sk_action {
>  	SK_PASS,
>  };
>  
> +/* User return codes for SK_MSG prog type. */
> +enum sk_msg_action {
> +	SK_MSG_DROP = 0,
> +	SK_MSG_PASS,
> +};
> +
> +/* user accessible metadata for SK_MSG packet hook, new fields must
> + * be added to the end of this structure
> + */
> +struct sk_msg_md {
> +	__u32 data;
> +	__u32 data_end;
> +};
> +
>  #define BPF_TAG_SIZE	8
>  
>  struct bpf_prog_info {
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 972608f..5793f3a 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -38,6 +38,7 @@
>  #include <linux/skbuff.h>
>  #include <linux/workqueue.h>
>  #include <linux/list.h>
> +#include <linux/mm.h>
>  #include <net/strparser.h>
>  #include <net/tcp.h>
>  
> @@ -47,6 +48,7 @@
>  struct bpf_stab {
>  	struct bpf_map map;
>  	struct sock **sock_map;
> +	struct bpf_prog *bpf_tx_msg;
>  	struct bpf_prog *bpf_parse;
>  	struct bpf_prog *bpf_verdict;
>  };
> @@ -74,6 +76,7 @@ struct smap_psock {
>  	struct sk_buff *save_skb;
>  
>  	struct strparser strp;
> +	struct bpf_prog *bpf_tx_msg;
>  	struct bpf_prog *bpf_parse;
>  	struct bpf_prog *bpf_verdict;
>  	struct list_head maps;
> @@ -90,6 +93,8 @@ struct smap_psock {
>  	void (*save_state_change)(struct sock *sk);
>  };
>  
> +static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> +
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
>  	return rcu_dereference_sk_user_data(sk);
> @@ -99,8 +104,439 @@ enum __sk_action {
>  	__SK_DROP = 0,
>  	__SK_PASS,
>  	__SK_REDIRECT,
> +	__SK_NONE,
>  };
>  
> +static int memcopy_from_iter(struct sock *sk, struct scatterlist *sg,
> +			     int sg_num, struct iov_iter *from, int bytes)
> +{
> +	int i, rc = 0;
> +
> +	for (i = 0; i < sg_num; ++i) {
> +		int copy = sg[i].length;
> +		char *to = sg_virt(&sg[i]);
> +
> +		if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
> +			rc = copy_from_iter_nocache(to, copy, from);
> +		else
> +			rc = copy_from_iter(to, copy, from);
> +
> +		if (rc != copy) {
> +			rc = -EFAULT;
> +			goto out;
> +		}
> +
> +		bytes -= copy;
> +		if (!bytes)
> +			break;
> +	}
> +out:
> +	return rc;
> +}
> +
> +static int bpf_tcp_push(struct sock *sk, struct scatterlist *sg,
> +			int *sg_end, int flags, bool charge)
> +{
> +	int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
> +	int offset, ret = 0;
> +	struct page *p;
> +	size_t size;
> +
> +	size = sg->length;
> +	offset = sg->offset;
> +
> +	while (1) {
> +		if (sg_is_last(sg))
> +			sendpage_flags = flags;
> +
> +		tcp_rate_check_app_limited(sk);
> +		p = sg_page(sg);
> +retry:
> +		ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
> +		if (ret != size) {
> +			if (ret > 0) {
> +				offset += ret;
> +				size -= ret;
> +				goto retry;
> +			}
> +
> +			if (charge)
> +				sk_mem_uncharge(sk,
> +						sg->length - size - sg->offset);

should the bool argument be called 'uncharge' instead ?

> +
> +			sg->offset = offset;
> +			sg->length = size;
> +			return ret;
> +		}
> +
> +		put_page(p);
> +		if (charge)
> +			sk_mem_uncharge(sk, sg->length);
> +		*sg_end += 1;
> +		sg = sg_next(sg);
> +		if (!sg)
> +			break;
> +
> +		offset = sg->offset;
> +		size = sg->length;
> +	}
> +
> +	return 0;
> +}
> +
> +static inline void bpf_compute_data_pointers_sg(struct sk_msg_buff *md)
> +{
> +	md->data = sg_virt(md->sg_data);
> +	md->data_end = md->data + md->sg_data->length;
> +}
> +
> +static void return_mem_sg(struct sock *sk, struct scatterlist *sg, int end)
> +{
> +	int i;
> +
> +	for (i = 0; i < end; ++i)
> +		sk_mem_uncharge(sk, sg[i].length);
> +}
> +
> +static int free_sg(struct sock *sk, struct scatterlist *sg, int start, int len)
> +{
> +	int i, free = 0;
> +
> +	for (i = start; i < len; ++i) {
> +		free += sg[i].length;
> +		sk_mem_uncharge(sk, sg[i].length);
> +		put_page(sg_page(&sg[i]));
> +	}
> +
> +	return free;
> +}
> +
> +static unsigned int smap_do_tx_msg(struct sock *sk,
> +				   struct smap_psock *psock,
> +				   struct sk_msg_buff *md)
> +{
> +	struct bpf_prog *prog;
> +	unsigned int rc, _rc;
> +
> +	preempt_disable();
> +	rcu_read_lock();
> +
> +	/* If the policy was removed mid-send then default to 'accept' */
> +	prog = READ_ONCE(psock->bpf_tx_msg);
> +	if (unlikely(!prog)) {
> +		_rc = SK_PASS;
> +		goto verdict;
> +	}
> +
> +	bpf_compute_data_pointers_sg(md);
> +	_rc = (*prog->bpf_func)(md, prog->insnsi);
> +
> +verdict:
> +	rcu_read_unlock();
> +	preempt_enable();
> +
> +	/* Moving return codes from UAPI namespace into internal namespace */
> +	rc = ((_rc == SK_PASS) ?
> +	      (md->map ? __SK_REDIRECT : __SK_PASS) :
> +	      __SK_DROP);
> +
> +	return rc;
> +}
> +
> +static int bpf_tcp_sendmsg_do_redirect(struct scatterlist *sg, int sg_num,
> +				       struct sk_msg_buff *md, int flags)
> +{
> +	int i, sg_curr = 0, err, free;
> +	struct smap_psock *psock;
> +	struct sock *sk;
> +
> +	rcu_read_lock();
> +	sk = do_msg_redirect_map(md);
> +	if (unlikely(!sk))
> +		goto out_rcu;
> +
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock))
> +		goto out_rcu;
> +
> +	if (!refcount_inc_not_zero(&psock->refcnt))
> +		goto out_rcu;
> +
> +	rcu_read_unlock();
> +	lock_sock(sk);
> +	err = bpf_tcp_push(sk, sg, &sg_curr, flags, false);
> +	if (unlikely(err))
> +		goto out;
> +	release_sock(sk);
> +	smap_release_sock(psock, sk);
> +	return 0;
> +out_rcu:
> +	rcu_read_unlock();
> +out:
> +	for (i = sg_curr; i < sg_num; ++i) {
> +		free += sg[i].length;
> +		put_page(sg_page(&sg[i]));
> +	}
> +	return free;

erro path keeps rcu_lock and sk locked?

> +}
> +
> +static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	int err = 0, eval = __SK_NONE, sg_size = 0, sg_num = 0;
> +	int flags = msg->msg_flags | MSG_NO_SHARED_FRAGS;
> +	struct sk_msg_buff md = {0};
> +	struct smap_psock *psock;
> +	size_t copy, copied = 0;
> +	struct scatterlist *sg;
> +	long timeo;
> +
> +	sg = md.sg_data;
> +	sg_init_table(sg, MAX_SKB_FRAGS);
> +
> +	/* Its possible a sock event or user removed the psock _but_ the ops
> +	 * have not been reprogrammed yet so we get here. In this case fallback
> +	 * to tcp_sendmsg. Note this only works because we _only_ ever allow
> +	 * a single ULP there is no hierarchy here.
> +	 */
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return tcp_sendmsg(sk, msg, size);
> +	}
> +
> +	/* Increment the psock refcnt to ensure its not released while sending a
> +	 * message. Required because sk lookup and bpf programs are used in
> +	 * separate rcu critical sections. Its OK if we lose the map entry
> +	 * but we can't lose the sock reference, possible when the refcnt hits
> +	 * zero and garbage collection calls sock_put().
> +	 */
> +	if (!refcount_inc_not_zero(&psock->refcnt)) {
> +		rcu_read_unlock();
> +		return tcp_sendmsg(sk, msg, size);
> +	}
> +
> +	rcu_read_unlock();
> +
> +	lock_sock(sk);
> +	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> +	while (msg_data_left(msg)) {
> +		int sg_curr;
> +
> +		if (sk->sk_err) {
> +			err = sk->sk_err;
> +			goto out_err;
> +		}
> +
> +		copy = msg_data_left(msg);
> +		if (!sk_stream_memory_free(sk))
> +			goto wait_for_sndbuf;
> +
> +		/* sg_size indicates bytes already allocated and sg_num
> +		 * is last sg element used. This is used when alloc_sg
> +		 * partially allocates a scatterlist and then is sent
> +		 * to wait for memory. In normal case (no memory pressure)
> +		 * both sg_nun and sg_size are zero.
> +		 */
> +		copy = copy - sg_size;
> +		err = sk_alloc_sg(sk, copy, sg, &sg_num, &sg_size, 0);
> +		if (err) {
> +			if (err != -ENOSPC)
> +				goto wait_for_memory;
> +			copy = sg_size;
> +		}
> +
> +		err = memcopy_from_iter(sk, sg, sg_num, &msg->msg_iter, copy);
> +		if (err < 0) {
> +			free_sg(sk, sg, 0, sg_num);
> +			goto out_err;
> +		}
> +
> +		copied += copy;
> +
> +		/* If msg is larger than MAX_SKB_FRAGS we can send multiple
> +		 * scatterlists per msg. However BPF decisions apply to the
> +		 * entire msg.
> +		 */
> +		if (eval == __SK_NONE)
> +			eval = smap_do_tx_msg(sk, psock, &md);

it seems sk_alloc_sg() will put 64k bytes into sg_data,
but this program will see only first SG ?
and it's typically going to be one page only ?
then what's the value of waiting for MAX_SKB_FRAGS ?

> +
> +		switch (eval) {
> +		case __SK_PASS:
> +			sg_mark_end(sg + sg_num - 1);
> +			err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
> +			if (unlikely(err)) {
> +				copied -= free_sg(sk, sg, sg_curr, sg_num);
> +				goto out_err;
> +			}
> +			break;
> +		case __SK_REDIRECT:
> +			sg_mark_end(sg + sg_num - 1);
> +			goto do_redir;
> +		case __SK_DROP:
> +		default:
> +			copied -= free_sg(sk, sg, 0, sg_num);
> +			goto out_err;
> +		}
> +
> +		sg_num = 0;
> +		sg_size = 0;
> +		continue;
> +wait_for_sndbuf:
> +		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +wait_for_memory:
> +		err = sk_stream_wait_memory(sk, &timeo);
> +		if (err)
> +			goto out_err;
> +	}
> +out_err:
> +	if (err < 0)
> +		err = sk_stream_error(sk, msg->msg_flags, err);
> +	release_sock(sk);
> +	smap_release_sock(psock, sk);
> +	return copied ? copied : err;
> +
> +do_redir:
> +	/* To avoid deadlock with multiple socks all doing redirects to
> +	 * each other we must first drop the current sock lock and release
> +	 * the psock. Then get the redirect socket (assuming it still
> +	 * exists), take it's lock, and finally do the send here. If the
> +	 * redirect fails there is nothing to do, we don't want to blame
> +	 * the sender for remote socket failures. Instead we simply
> +	 * continue making forward progress.
> +	 */
> +	return_mem_sg(sk, sg, sg_num);
> +	release_sock(sk);
> +	smap_release_sock(psock, sk);
> +	copied -= bpf_tcp_sendmsg_do_redirect(sg, sg_num, &md, flags);
> +	return copied;
> +}
> +
> +static int bpf_tcp_sendpage_do_redirect(struct page *page, int offset,
> +					size_t size, int flags,
> +					struct sk_msg_buff *md)
> +{
> +	struct smap_psock *psock;
> +	struct sock *sk;
> +	int rc;
> +
> +	rcu_read_lock();
> +	sk = do_msg_redirect_map(md);
> +	if (unlikely(!sk))
> +		goto out_rcu;
> +
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock))
> +		goto out_rcu;
> +
> +	if (!refcount_inc_not_zero(&psock->refcnt))
> +		goto out_rcu;
> +
> +	rcu_read_unlock();
> +
> +	lock_sock(sk);
> +	rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> +	release_sock(sk);
> +
> +	smap_release_sock(psock, sk);
> +	return rc;
> +out_rcu:
> +	rcu_read_unlock();
> +	return -EINVAL;
> +}
> +
> +static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
> +			    int offset, size_t size, int flags)
> +{
> +	struct smap_psock *psock;
> +	int rc, _rc = __SK_PASS;
> +	struct bpf_prog *prog;
> +	struct sk_msg_buff md;
> +
> +	preempt_disable();
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock))
> +		goto verdict;
> +
> +	/* If the policy was removed mid-send then default to 'accept' */
> +	prog = READ_ONCE(psock->bpf_tx_msg);
> +	if (unlikely(!prog))
> +		goto verdict;
> +
> +	/* Calculate pkt data pointers and run BPF program */
> +	md.data = page_address(page) + offset;
> +	md.data_end = md.data + size;
> +	_rc = (*prog->bpf_func)(&md, prog->insnsi);
> +
> +verdict:
> +	rcu_read_unlock();
> +	preempt_enable();
> +
> +	/* Moving return codes from UAPI namespace into internal namespace */
> +	rc = ((_rc == SK_PASS) ? __SK_PASS : __SK_DROP);
> +
> +	switch (rc) {
> +	case __SK_PASS:
> +		lock_sock(sk);
> +		rc = tcp_sendpage_locked(sk, page, offset, size, flags);
> +		release_sock(sk);
> +		break;
> +	case __SK_REDIRECT:
> +		smap_release_sock(psock, sk);
> +		rc = bpf_tcp_sendpage_do_redirect(page, offset, size, flags,
> +						  &md);

looks like this path wasn't tested,
since above rc = ...; line cannot return REDIRECT...
probably should be common helper for both tcp_bpf_*() funcs
to call into bpf and convert rc.

> +		break;
> +	case __SK_DROP:
> +	default:
> +		rc = -EACCES;
> +	}
> +
> +	return rc;
> +}
> +
> +static int bpf_tcp_msg_add(struct smap_psock *psock,
> +			   struct sock *sk,
> +			   struct bpf_prog *tx_msg)
> +{
> +	struct bpf_prog *orig_tx_msg;
> +
> +	orig_tx_msg = xchg(&psock->bpf_tx_msg, tx_msg);
> +	if (orig_tx_msg)
> +		bpf_prog_put(orig_tx_msg);

the function is replacing the program. why is it called bpf_tcp_msg_add ?

> +
> +	return tcp_set_ulp_id(sk, TCP_ULP_BPF);
> +}
> +
> +struct proto tcp_bpf_proto;
> +static int bpf_tcp_init(struct sock *sk)
> +{
> +	sk->sk_prot = &tcp_bpf_proto;
> +	return 0;
> +}
> +
> +static void bpf_tcp_release(struct sock *sk)
> +{
> +	sk->sk_prot = &tcp_prot;
> +}
> +
> +static struct tcp_ulp_ops bpf_tcp_ulp_ops __read_mostly = {
> +	.name			= "bpf_tcp",
> +	.uid			= TCP_ULP_BPF,
> +	.owner			= NULL,
> +	.init			= bpf_tcp_init,
> +	.release		= bpf_tcp_release,
> +};
> +
> +static int bpf_tcp_ulp_register(void)
> +{
> +	tcp_bpf_proto = tcp_prot;
> +	tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> +	tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
> +	return tcp_register_ulp(&bpf_tcp_ulp_ops);

I don't see corresponding tcp_unregister_ulp().

> +}
> +
>  static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
>  {
>  	struct bpf_prog *prog = READ_ONCE(psock->bpf_verdict);
> @@ -165,8 +601,6 @@ static void smap_report_sk_error(struct smap_psock *psock, int err)
>  	sk->sk_error_report(sk);
>  }
>  
> -static void smap_release_sock(struct smap_psock *psock, struct sock *sock);
> -
>  /* Called with lock_sock(sk) held */
>  static void smap_state_change(struct sock *sk)
>  {
> @@ -317,6 +751,7 @@ static void smap_write_space(struct sock *sk)
>  
>  static void smap_stop_sock(struct smap_psock *psock, struct sock *sk)
>  {
> +	tcp_cleanup_ulp(sk);
>  	if (!psock->strp_enabled)
>  		return;
>  	sk->sk_data_ready = psock->save_data_ready;
> @@ -384,7 +819,6 @@ static int smap_parse_func_strparser(struct strparser *strp,
>  	return rc;
>  }
>  
> -
>  static int smap_read_sock_done(struct strparser *strp, int err)
>  {
>  	return err;
> @@ -456,6 +890,8 @@ static void smap_gc_work(struct work_struct *w)
>  		bpf_prog_put(psock->bpf_parse);
>  	if (psock->bpf_verdict)
>  		bpf_prog_put(psock->bpf_verdict);
> +	if (psock->bpf_tx_msg)
> +		bpf_prog_put(psock->bpf_tx_msg);
>  
>  	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
>  		list_del(&e->list);
> @@ -491,8 +927,7 @@ static struct smap_psock *smap_init_psock(struct sock *sock,
>  
>  static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
>  {
> -	struct bpf_stab *stab;
> -	int err = -EINVAL;
> +	struct bpf_stab *stab; int err = -EINVAL;
>  	u64 cost;
>  
>  	if (!capable(CAP_NET_ADMIN))
> @@ -506,6 +941,10 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr)
>  	if (attr->value_size > KMALLOC_MAX_SIZE)
>  		return ERR_PTR(-E2BIG);
>  
> +	err = bpf_tcp_ulp_register();
> +	if (err && err != -EEXIST)
> +		return ERR_PTR(err);
> +
>  	stab = kzalloc(sizeof(*stab), GFP_USER);
>  	if (!stab)
>  		return ERR_PTR(-ENOMEM);
> @@ -590,6 +1029,8 @@ static void sock_map_free(struct bpf_map *map)
>  		bpf_prog_put(stab->bpf_verdict);
>  	if (stab->bpf_parse)
>  		bpf_prog_put(stab->bpf_parse);
> +	if (stab->bpf_tx_msg)
> +		bpf_prog_put(stab->bpf_tx_msg);
>  
>  	sock_map_remove_complete(stab);
>  }
> @@ -684,7 +1125,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  {
>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>  	struct smap_psock_map_entry *e = NULL;
> -	struct bpf_prog *verdict, *parse;
> +	struct bpf_prog *verdict, *parse, *tx_msg;
>  	struct sock *osock, *sock;
>  	struct smap_psock *psock;
>  	u32 i = *(u32 *)key;
> @@ -710,6 +1151,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  	 */
>  	verdict = READ_ONCE(stab->bpf_verdict);
>  	parse = READ_ONCE(stab->bpf_parse);
> +	tx_msg = READ_ONCE(stab->bpf_tx_msg);
>  
>  	if (parse && verdict) {
>  		/* bpf prog refcnt may be zero if a concurrent attach operation
> @@ -728,6 +1170,17 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
>  		}
>  	}
>  
> +	if (tx_msg) {
> +		tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);

prog_inc_not_zero() looks scary here.
Why 'not_zero' is necessary ?

> +		if (IS_ERR(tx_msg)) {
> +			if (verdict)
> +				bpf_prog_put(verdict);
> +			if (parse)
> +				bpf_prog_put(parse);
> +			return PTR_ERR(tx_msg);
> +		}
> +	}
> +