lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 12 Dec 2019 19:23:58 +0000
From:   Martin Lau <kafai@...com>
To:     unlisted-recipients:; (no To-header on input)
CC:     John Fastabend <john.fastabend@...il.com>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "kernel-team@...udflare.com" <kernel-team@...udflare.com>
Subject: Re: [PATCH bpf-next 4/8] bpf, sockmap: Don't let child socket inherit
 psock or its ops on copy

On Thu, Dec 12, 2019 at 12:27:19PM +0100, Jakub Sitnicki wrote:
> On Wed, Dec 11, 2019 at 06:20 PM CET, Martin Lau wrote:
> > On Tue, Dec 10, 2019 at 03:45:37PM +0100, Jakub Sitnicki wrote:
> >> John, Martin,
> >>
> >> On Tue, Nov 26, 2019 at 07:36 PM CET, Jakub Sitnicki wrote:
> >> > On Tue, Nov 26, 2019 at 06:16 PM CET, Martin Lau wrote:
> >> >> On Tue, Nov 26, 2019 at 04:54:33PM +0100, Jakub Sitnicki wrote:
> >> >>> On Mon, Nov 25, 2019 at 11:38 PM CET, Martin Lau wrote:
> >> >>> > On Sat, Nov 23, 2019 at 12:07:47PM +0100, Jakub Sitnicki wrote:
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> @@ -370,6 +378,11 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >> >>> >>  			sk->sk_prot = psock->sk_proto;
> >> >>> >>  		psock->sk_proto = NULL;
> >> >>> >>  	}
> >> >>> >> +
> >> >>> >> +	if (psock->icsk_af_ops) {
> >> >>> >> +		icsk->icsk_af_ops = psock->icsk_af_ops;
> >> >>> >> +		psock->icsk_af_ops = NULL;
> >> >>> >> +	}
> >> >>> >>  }
> >> >>> >
> >> >>> > [ ... ]
> >> >>> >
> >> >>> >> +static struct sock *tcp_bpf_syn_recv_sock(const struct sock *sk,
> >> >>> >> +					  struct sk_buff *skb,
> >> >>> >> +					  struct request_sock *req,
> >> >>> >> +					  struct dst_entry *dst,
> >> >>> >> +					  struct request_sock *req_unhash,
> >> >>> >> +					  bool *own_req)
> >> >>> >> +{
> >> >>> >> +	const struct inet_connection_sock_af_ops *ops;
> >> >>> >> +	void (*write_space)(struct sock *sk);
> >> >>> >> +	struct sk_psock *psock;
> >> >>> >> +	struct proto *proto;
> >> >>> >> +	struct sock *child;
> >> >>> >> +
> >> >>> >> +	rcu_read_lock();
> >> >>> >> +	psock = sk_psock(sk);
> >> >>> >> +	if (likely(psock)) {
> >> >>> >> +		proto = psock->sk_proto;
> >> >>> >> +		write_space = psock->saved_write_space;
> >> >>> >> +		ops = psock->icsk_af_ops;
> >> >>> > It is not immediately clear to me what ensure
> >> >>> > ops is not NULL here.
> >> >>> >
> >> >>> > It is likely I missed something.  A short comment would
> >> >>> > be very useful here.
> >> >>>
> >> >>> I can see the readability problem. Looking at it now, perhaps it should
> >> >>> be rewritten, to the same effect, as:
> >> >>>
> >> >>> static struct sock *tcp_bpf_syn_recv_sock(...)
> >> >>> {
> >> >>> 	const struct inet_connection_sock_af_ops *ops = NULL;
> >> >>>         ...
> >> >>>
> >> >>>         rcu_read_lock();
> >> >>> 	psock = sk_psock(sk);
> >> >>> 	if (likely(psock)) {
> >> >>> 		proto = psock->sk_proto;
> >> >>> 		write_space = psock->saved_write_space;
> >> >>> 		ops = psock->icsk_af_ops;
> >> >>> 	}
> >> >>> 	rcu_read_unlock();
> >> >>>
> >> >>>         if (!ops)
> >> >>> 		ops = inet_csk(sk)->icsk_af_ops;
> >> >>>         child = ops->syn_recv_sock(sk, skb, req, dst, req_unhash, own_req);
> >> >>>
> >> >>> If psock->icsk_af_ops were NULL, it would mean we haven't initialized it
> >> >>> properly. To double check what happens here:
> >> >> I did not mean the init path.  The init path is fine since it init
> >> >> eveything on psock before publishing the sk to the sock_map.
> >> >>
> >> >> I was thinking the delete path (e.g. sock_map_delete_elem).  It is not clear
> >> >> to me what prevent the earlier pasted sk_psock_restore_proto() which sets
> >> >> psock->icsk_af_ops to NULL from running in parallel with
> >> >> tcp_bpf_syn_recv_sock()?  An explanation would be useful.
> >> >
> >> > Ah, I misunderstood. Nothing prevents the race, AFAIK.
> >> >
> >> > Setting psock->icsk_af_ops to null on restore and not checking for it
> >> > here was a bad move on my side.  Also I need to revisit what to do about
> >> > psock->sk_proto so the child socket doesn't end up with null sk_proto.
> >> >
> >> > This race should be easy enough to trigger. Will give it a shot.
> >>
> >> I've convinced myself that this approach is racy beyond repair.
> >>
> >> Once syn_recv_sock() has returned it is too late to reset the child
> >> sk_user_data and restore its callbacks. It has been already inserted
> >> into ehash and ingress path can invoke its callbacks.
> >>
> >> The race can be triggered with with a reproducer where:
> >>
> >> thread-1:
> >>
> >>         p = accept(s, ...);
> >>         close(p);
> >>
> >> thread-2:
> >>
> >> 	bpf_map_update_elem(mapfd, &key, &s, BPF_NOEXIST);
> >> 	bpf_map_delete_elem(mapfd, &key);
> >>
> >> This a dead-end because we can't have the parent and the child share the
> >> psock state. Even though psock itself is refcounted, and potentially we
> >> could grab a reference before cloning the parent, link into the map that
> >> psock holds is not.
> >>
> >> Two ways out come to mind. Both involve touching TCP code, which I was
> >> hoping to avoid:
> >>
> >> 1) reset sk_user_data when initializing the child
> >>
> >>    This is problematic because tcp_bpf callbacks are not designed to
> >>    handle sockets with no psock _and_ with overridden sk_prot
> >>    callbacks. (Although, I think they could if the fallback was directly
> >>    on {tcp,tcpv6}_prot based on socket domain.)
> >>
> >>    Also, there are other sk_user_data users like DRBD which rely on
> >>    sharing the sk_user_data pointer between parent and child, if I read
> >>    the code correctly [0]. If anything, clearing the sk_user_data on
> >>    clone would have to be guarded by a flag.
> > Can the copy/not-to-copy sk_user_data decision be made in
> > sk_clone_lock()?
> 
> Yes, this could be pushed down to sk_clone_lock(), where we do similar
> work (reset sk_reuseport_cb and clone bpf_sk_storage):
aha.  I missed your eariler "clearing the sk_user_data on clone would have
to be guarded by a flag..." part.  It turns out we were talking the same
thing on (1).  sock_flag works better if there is still bit left (and it
seems there is one),  although I was thinking more like adding
something (e.g. a func ptr) to 'struct proto' to mangle sk_user_data
before returning newsk....but not sure this kind of logic
belongs to 'struct proto'

> 
> 	/* User data can hold reference. Child must not
> 	 * inherit the pointer without acquiring a reference.
> 	 */
> 	if (sock_flag(sk, SOCK_OWNS_USER_DATA)) {
> 		sock_reset_flag(newsk, SOCK_OWNS_USER_DATA);
> 		RCU_INIT_POINTER(newsk->sk_user_data, NULL);
> 	}
> 
> I belive this would still need to be guarded by a flag.  Do you see
> value in clearing child sk_user_data on clone as opposed to dealying
> that work until accept() time?
It seems to me clearing things up front at the very beginning is more
straight forward, such that it does not have to worry about the
sk_user_data may be used in a wrong way before it gets a chance
to be cleared in accept().

Just something to consider, if it is obvious that there is no hole in
clearing it in accept(), it is fine too.

> >>
> >> 2) Restore sk_prot callbacks on clone to {tcp,tcpv6}_prot
> >>
> >>    The simpler way out. tcp_bpf callbacks never get invoked on the child
> >>    socket so the copied psock reference is no longer a problem. We can
> >>    clear the pointer on accept().
> >>
> >>    So far I wasn't able poke any holes in it and it comes down to
> >>    patching tcp_create_openreq_child() with:
> >>
> >> 	/* sk_msg and ULP frameworks can override the callbacks into
> >> 	 * protocol. We don't assume they are intended to be inherited
> >> 	 * by the child. Frameworks can re-install the callbacks on
> >> 	 * accept() if needed.
> >> 	 */
> >> 	WRITE_ONCE(newsk->sk_prot, sk->sk_prot_creator);
> >>
> >>    That's what I'm going with for v2.
> >>
> >> Open to suggestions.
> >>
> >> Thanks,
> >> Jakub
> >>
> >> BTW. Reading into kTLS code, I noticed it has been limited down to just
> >> established sockets due to the same problem I'm struggling with here:
> >>
> >> static int tls_init(struct sock *sk)
> >> {
> >> ...
> >> 	/* The TLS ulp is currently supported only for TCP sockets
> >> 	 * in ESTABLISHED state.
> >> 	 * Supporting sockets in LISTEN state will require us
> >> 	 * to modify the accept implementation to clone rather then
> >> 	 * share the ulp context.
> >> 	 */
> >> 	if (sk->sk_state != TCP_ESTABLISHED)
> >> 		return -ENOTCONN;
> >>
> >> [0] https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.com_linux_v5.5-2Drc1_source_drivers_block_drbd_drbd-5Freceiver.c-23L682&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=z2Cz1gEcqiw-8YqVOluxlUHh_CBs6PJWQN2vgirOyFk&s=WAiM0asZN0OkqrW02xm2mCMIzWhKQCc3KiY7pzMKNg4&e=

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ