netdev - Re: [PATCH net-next v3 13/24] ovpn: implement TCP transport

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2ddf759d-378f-475c-8fc1-30c6e83c2d14@openvpn.net>
Date: Wed, 15 May 2024 00:11:28 +0200
From: Antonio Quartulli <antonio@...nvpn.net>
To: Sabrina Dubroca <sd@...asysnail.net>
Cc: netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
 Sergey Ryazanov <ryazanov.s.a@...il.com>, Paolo Abeni <pabeni@...hat.com>,
 Eric Dumazet <edumazet@...gle.com>, Andrew Lunn <andrew@...n.ch>,
 Esben Haabendal <esben@...nix.com>
Subject: Re: [PATCH net-next v3 13/24] ovpn: implement TCP transport

On 14/05/2024 10:58, Sabrina Dubroca wrote:
>>>> diff --git a/drivers/net/ovpn/peer.h b/drivers/net/ovpn/peer.h
>>>> index b5ff59a4b40f..ac4907705d98 100644
>>>> --- a/drivers/net/ovpn/peer.h
>>>> +++ b/drivers/net/ovpn/peer.h
>>>> + * @tcp.raw_len: next packet length as read from the stream (TCP only)
>>>> + * @tcp.skb: next packet being filled with data from the stream (TCP only)
>>>> + * @tcp.offset: position of the next byte to write in the skb (TCP only)
>>>> + * @tcp.data_len: next packet length converted to host order (TCP only)
>>>
>>> It would be nice to add information about whether they're used for TX or RX.
>>
>> they are all about "from the stream" and "to the skb", meaning that we are
>> doing RX.
>> Will make it more explicit.
> 
> Maybe group them in a struct rx?

yap, makes sense.

> 
>>>> + * @tcp.sk_cb.sk_data_ready: pointer to original cb
>>>> + * @tcp.sk_cb.sk_write_space: pointer to original cb
>>>> + * @tcp.sk_cb.prot: pointer to original prot object
>>>>     * @crypto: the crypto configuration (ciphers, keys, etc..)
>>>>     * @dst_cache: cache for dst_entry used to send to peer
>>>>     * @bind: remote peer binding
>>>> @@ -59,6 +69,25 @@ struct ovpn_peer {
>>>>    	struct ptr_ring netif_rx_ring;
>>>>    	struct napi_struct napi;
>>>>    	struct ovpn_socket *sock;
>>>> +	/* state of the TCP reading. Needed to keep track of how much of a
>>>> +	 * single packet has already been read from the stream and how much is
>>>> +	 * missing
>>>> +	 */
>>>> +	struct {
>>>> +		struct ptr_ring tx_ring;
>>>> +		struct work_struct tx_work;
>>>> +		struct work_struct rx_work;
>>>> +
>>>> +		u8 raw_len[sizeof(u16)];
>>>
>>> Why not u16 or __be16 for this one?
>>
>> because in this array we are putting the bytes as we get them from the
>> stream.
>> We may be at the point where one out of two bytes is available on the
>> stream. For this reason I use an array to store this u16 byte by byte.
>>
>> Once thw two bytes are ready, we convert the content in an actual int and
>> store it in "data_len" (a few lines below).
> 
> Ok, I see. Hopefully you can switch to strparser and make this one go
> away.
> 
> 
>>>> diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c
>>>> index e099a61b03fa..004db5b13663 100644
>>>> --- a/drivers/net/ovpn/socket.c
>>>> +++ b/drivers/net/ovpn/socket.c
>>>> @@ -16,6 +16,7 @@
>>>>    #include "packet.h"
>>>>    #include "peer.h"
>>>>    #include "socket.h"
>>>> +#include "tcp.h"
>>>>    #include "udp.h"
>>>>    /* Finalize release of socket, called after RCU grace period */
>>>> @@ -26,6 +27,8 @@ static void ovpn_socket_detach(struct socket *sock)
>>>>    	if (sock->sk->sk_protocol == IPPROTO_UDP)
>>>>    		ovpn_udp_socket_detach(sock);
>>>> +	else if (sock->sk->sk_protocol == IPPROTO_TCP)
>>>> +		ovpn_tcp_socket_detach(sock);
>>>>    	sockfd_put(sock);
>>>>    }
>>>> @@ -69,6 +72,8 @@ static int ovpn_socket_attach(struct socket *sock, struct ovpn_peer *peer)
>>>>    	if (sock->sk->sk_protocol == IPPROTO_UDP)
>>>>    		ret = ovpn_udp_socket_attach(sock, peer->ovpn);
>>>> +	else if (sock->sk->sk_protocol == IPPROTO_TCP)
>>>> +		ret = ovpn_tcp_socket_attach(sock, peer);
>>>>    	return ret;
>>>>    }
>>>> @@ -124,6 +129,21 @@ struct ovpn_socket *ovpn_socket_new(struct socket *sock, struct ovpn_peer *peer)
>>>>    	ovpn_sock->sock = sock;
>>>
>>> The line above this is:
>>>
>>>       ovpn_sock->ovpn = peer->ovpn;
>>>
>>> It's technically fine since you then overwrite this with peer in case
>>> we're on TCP, but ovpn_sock->ovpn only exists on UDP since you moved
>>> it into a union in this patch.
>>
>> Yeah, I did not want to make another branch, but having a UDP specific case
>> will make code easier to read.
> 
> Either that, or drop the union.

ACK

> 
> 
>>>> diff --git a/drivers/net/ovpn/tcp.c b/drivers/net/ovpn/tcp.c
>>>> new file mode 100644
>>>> index 000000000000..84ad7cd4fc4f
>>>> --- /dev/null
>>>> +++ b/drivers/net/ovpn/tcp.c
>>>> @@ -0,0 +1,511 @@
>>>> +static int ovpn_tcp_read_sock(read_descriptor_t *desc, struct sk_buff *in_skb,
>>>> +			      unsigned int in_offset, size_t in_len)
>>>> +{
>>>> +	struct sock *sk = desc->arg.data;
>>>> +	struct ovpn_socket *sock;
>>>> +	struct ovpn_skb_cb *cb;
>>>> +	struct ovpn_peer *peer;
>>>> +	size_t chunk, copied = 0;
>>>> +	void *data;
>>>> +	u16 len;
>>>> +	int st;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	sock = rcu_dereference_sk_user_data(sk);
>>>> +	rcu_read_unlock();
>>>
>>> You can't just release rcu_read_lock and keep using sock (here and in
>>> the rest of this file). Either you keep rcu_read_lock, or you can take
>>> a reference on the ovpn_socket.
>>
>> I was just staring at this today, after having worked on the
>> rcu_read_lock/unlock for the peer get()s..
>>
>> I thinkt the assumption was: if we are in this read_sock callback, it's
>> impossible that the ovpn_socket was invalidated, because it gets invalidated
>> upon detach, which also prevents any further calling of this callback. But
>> this sounds racy, and I guess we should somewhat hold a reference..
> 
> ovpn_tcp_read_sock starts
> 
> detach
> kfree_rcu(ovpn_socket)
> ...
> ovpn_socket actually freed
> ...
> ovpn_tcp_read_sock continues with freed ovpn_socket
> 
> 
> I don't think anything in the current code prevents this.

mh yeah, if something like this happens right after having started the 
read_sock we are doomed.
Will fix this.


> 
> 
>>>> +/* Set TCP encapsulation callbacks */
>>>> +int ovpn_tcp_socket_attach(struct socket *sock, struct ovpn_peer *peer)
>>>> +{
>>>> +	void *old_data;
>>>> +	int ret;
>>>> +
>>>> +	INIT_WORK(&peer->tcp.tx_work, ovpn_tcp_tx_work);
>>>> +
>>>> +	ret = ptr_ring_init(&peer->tcp.tx_ring, OVPN_QUEUE_LEN, GFP_KERNEL);
>>>> +	if (ret < 0) {
>>>> +		netdev_err(peer->ovpn->dev, "cannot allocate TCP TX ring\n");
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	peer->tcp.skb = NULL;
>>>> +	peer->tcp.offset = 0;
>>>> +	peer->tcp.data_len = 0;
>>>> +
>>>> +	write_lock_bh(&sock->sk->sk_callback_lock);
>>>> +
>>>> +	/* make sure no pre-existing encapsulation handler exists */
>>>> +	rcu_read_lock();
>>>> +	old_data = rcu_dereference_sk_user_data(sock->sk);
>>>> +	rcu_read_unlock();
>>>> +	if (old_data) {
>>>> +		netdev_err(peer->ovpn->dev,
>>>> +			   "provided socket already taken by other user\n");
>>>> +		ret = -EBUSY;
>>>> +		goto err;
>>>
>>> The UDP code differentiates "socket already owned by this interface"
>>> from "already taken by other user". That doesn't apply to TCP?
>>
>> This makes me wonder: how safe it is to interpret the user data as an object
>> of type ovpn_socket?
>>
>> When we find the user data already assigned, we don't know what was really
>> stored in there, right?
>> Technically this socket could have gone through another module which
>> assigned its own state.
>>
>> Therefore I think that what UDP does [ dereferencing ((struct ovpn_socket
>> *)user_data)->ovpn ] is probably not safe. Would you agree?
> 
> Hmmm, yeah, I think you're right. If you checked encap_type ==
> UDP_ENCAP_OVPNINUDP before (sk_prot for TCP), then you'd know it's
> really your data. Basically call ovpn_from_udp_sock during attach if
> you want to check something beyond EBUSY.

right. Maybe we can leave with simply reporting EBUSY and be done with 
it, without adding extra checks and what not.

> 
> Once you're in your own callbacks, it should be safe. If some other
> code sends packet with a non-ovpn socket to ovpn's ->encap_rcv,
> something is really broken.

yup

> 
>>>> +int __init ovpn_tcp_init(void)
>>>> +{
>>>> +	/* We need to substitute the recvmsg and the sock_is_readable
>>>> +	 * callbacks in the sk_prot member of the sock object for TCP
>>>> +	 * sockets.
>>>> +	 *
>>>> +	 * However sock->sk_prot is a pointer to a static variable and
>>>> +	 * therefore we can't directly modify it, otherwise every socket
>>>> +	 * pointing to it will be affected.
>>>> +	 *
>>>> +	 * For this reason we create our own static copy and modify what
>>>> +	 * we need. Then we make sk_prot point to this copy
>>>> +	 * (in ovpn_tcp_socket_attach())
>>>> +	 */
>>>> +	ovpn_tcp_prot = tcp_prot;
>>>
>>> Don't you need a separate variant for IPv6, like TLS does?
>>
>> Never did so far.
>>
>> My wild wild wild guess: for the time this socket is owned by ovpn, we only
>> use callbacks that are IPvX agnostic, hence v4 vs v6 doesn't make any
>> difference.
>> When this socket is released, we reassigned the original prot.
> 
> That seems a bit suspicious to me. For example, tcpv6_prot has a
> different backlog_rcv. And you don't control if the socket is detached
> before being closed, or which callbacks are needed. Your userspace
> client doesn't use them, but someone else's might.
> 
>>>> +	ovpn_tcp_prot.recvmsg = ovpn_tcp_recvmsg;
>>>
>>> You don't need to replace ->sendmsg as well? The userspace client is
>>> not expected to send messages?
>>
>> It is, but my assumption is that those packets will just go through the
>> socket as usual. No need to be handled by ovpn (those packets are not
>> encrypted/decrypted, like data traffic is).
>> And this is how it has worked so far.
>>
>> Makes sense?
> 
> Two things come to mind:
> 
> - userspace is expected to prefix the messages it inserts on the
>    stream with the 2-byte length field? otherwise, the peer won't be
>    able to parse them out of the stream

correct. userspace sends those packets as if ovpn is not running, 
therefore this happens naturally.

> 
> - I'm not convinced this would be safe wrt kernel writing partial
>    messages. if ovpn_tcp_send_one doesn't send the full message, you
>    could interleave two messages:
> 
>    +------+-------------------+------+--------+----------------+
>    | len1 | (bytes from msg1) | len2 | (msg2) | (rest of msg1) |
>    +------+-------------------+------+--------+----------------+
> 
>    and the RX side would parse that as:
> 
>    +------+-----------------------------------+------+---------
>    | len1 | (bytes from msg1) | len2 | (msg2) | ???? | ...
>    +------+-------------------+---------------+------+---------
> 
>    and try to interpret some random bytes out of either msg1 or msg2 as
>    a length prefix, resulting in a broken stream.

hm you are correct. if multiple sendmsg can overlap, then we might be in 
troubles, but are we sure this can truly happen?

> 
> 
> The stream format looks identical to ESP in TCP [1] (2B length prefix
> followed by the actual message), so I think the espintcp code (both tx
> and rx, except for actual protocol parsing) should look very
> similar. The problems that need to be solved for both protocols are
> pretty much the same.

ok, will have a look. maybe this will simplify the code even more and we 
will get rid of some of the issues we were discussing above.

Thanks!

> 
> [1] https://www.rfc-editor.org/rfc/rfc8229#section-3
> 

-- 
Antonio Quartulli
OpenVPN Inc.