[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2ddf759d-378f-475c-8fc1-30c6e83c2d14@openvpn.net>
Date: Wed, 15 May 2024 00:11:28 +0200
From: Antonio Quartulli <antonio@...nvpn.net>
To: Sabrina Dubroca <sd@...asysnail.net>
Cc: netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
Sergey Ryazanov <ryazanov.s.a@...il.com>, Paolo Abeni <pabeni@...hat.com>,
Eric Dumazet <edumazet@...gle.com>, Andrew Lunn <andrew@...n.ch>,
Esben Haabendal <esben@...nix.com>
Subject: Re: [PATCH net-next v3 13/24] ovpn: implement TCP transport
On 14/05/2024 10:58, Sabrina Dubroca wrote:
>>>> diff --git a/drivers/net/ovpn/peer.h b/drivers/net/ovpn/peer.h
>>>> index b5ff59a4b40f..ac4907705d98 100644
>>>> --- a/drivers/net/ovpn/peer.h
>>>> +++ b/drivers/net/ovpn/peer.h
>>>> + * @tcp.raw_len: next packet length as read from the stream (TCP only)
>>>> + * @tcp.skb: next packet being filled with data from the stream (TCP only)
>>>> + * @tcp.offset: position of the next byte to write in the skb (TCP only)
>>>> + * @tcp.data_len: next packet length converted to host order (TCP only)
>>>
>>> It would be nice to add information about whether they're used for TX or RX.
>>
>> they are all about "from the stream" and "to the skb", meaning that we are
>> doing RX.
>> Will make it more explicit.
>
> Maybe group them in a struct rx?
yap, makes sense.
>
>>>> + * @tcp.sk_cb.sk_data_ready: pointer to original cb
>>>> + * @tcp.sk_cb.sk_write_space: pointer to original cb
>>>> + * @tcp.sk_cb.prot: pointer to original prot object
>>>> * @crypto: the crypto configuration (ciphers, keys, etc..)
>>>> * @dst_cache: cache for dst_entry used to send to peer
>>>> * @bind: remote peer binding
>>>> @@ -59,6 +69,25 @@ struct ovpn_peer {
>>>> struct ptr_ring netif_rx_ring;
>>>> struct napi_struct napi;
>>>> struct ovpn_socket *sock;
>>>> + /* state of the TCP reading. Needed to keep track of how much of a
>>>> + * single packet has already been read from the stream and how much is
>>>> + * missing
>>>> + */
>>>> + struct {
>>>> + struct ptr_ring tx_ring;
>>>> + struct work_struct tx_work;
>>>> + struct work_struct rx_work;
>>>> +
>>>> + u8 raw_len[sizeof(u16)];
>>>
>>> Why not u16 or __be16 for this one?
>>
>> because in this array we are putting the bytes as we get them from the
>> stream.
>> We may be at the point where one out of two bytes is available on the
>> stream. For this reason I use an array to store this u16 byte by byte.
>>
>> Once thw two bytes are ready, we convert the content in an actual int and
>> store it in "data_len" (a few lines below).
>
> Ok, I see. Hopefully you can switch to strparser and make this one go
> away.
>
>
>>>> diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c
>>>> index e099a61b03fa..004db5b13663 100644
>>>> --- a/drivers/net/ovpn/socket.c
>>>> +++ b/drivers/net/ovpn/socket.c
>>>> @@ -16,6 +16,7 @@
>>>> #include "packet.h"
>>>> #include "peer.h"
>>>> #include "socket.h"
>>>> +#include "tcp.h"
>>>> #include "udp.h"
>>>> /* Finalize release of socket, called after RCU grace period */
>>>> @@ -26,6 +27,8 @@ static void ovpn_socket_detach(struct socket *sock)
>>>> if (sock->sk->sk_protocol == IPPROTO_UDP)
>>>> ovpn_udp_socket_detach(sock);
>>>> + else if (sock->sk->sk_protocol == IPPROTO_TCP)
>>>> + ovpn_tcp_socket_detach(sock);
>>>> sockfd_put(sock);
>>>> }
>>>> @@ -69,6 +72,8 @@ static int ovpn_socket_attach(struct socket *sock, struct ovpn_peer *peer)
>>>> if (sock->sk->sk_protocol == IPPROTO_UDP)
>>>> ret = ovpn_udp_socket_attach(sock, peer->ovpn);
>>>> + else if (sock->sk->sk_protocol == IPPROTO_TCP)
>>>> + ret = ovpn_tcp_socket_attach(sock, peer);
>>>> return ret;
>>>> }
>>>> @@ -124,6 +129,21 @@ struct ovpn_socket *ovpn_socket_new(struct socket *sock, struct ovpn_peer *peer)
>>>> ovpn_sock->sock = sock;
>>>
>>> The line above this is:
>>>
>>> ovpn_sock->ovpn = peer->ovpn;
>>>
>>> It's technically fine since you then overwrite this with peer in case
>>> we're on TCP, but ovpn_sock->ovpn only exists on UDP since you moved
>>> it into a union in this patch.
>>
>> Yeah, I did not want to make another branch, but having a UDP specific case
>> will make code easier to read.
>
> Either that, or drop the union.
ACK
>
>
>>>> diff --git a/drivers/net/ovpn/tcp.c b/drivers/net/ovpn/tcp.c
>>>> new file mode 100644
>>>> index 000000000000..84ad7cd4fc4f
>>>> --- /dev/null
>>>> +++ b/drivers/net/ovpn/tcp.c
>>>> @@ -0,0 +1,511 @@
>>>> +static int ovpn_tcp_read_sock(read_descriptor_t *desc, struct sk_buff *in_skb,
>>>> + unsigned int in_offset, size_t in_len)
>>>> +{
>>>> + struct sock *sk = desc->arg.data;
>>>> + struct ovpn_socket *sock;
>>>> + struct ovpn_skb_cb *cb;
>>>> + struct ovpn_peer *peer;
>>>> + size_t chunk, copied = 0;
>>>> + void *data;
>>>> + u16 len;
>>>> + int st;
>>>> +
>>>> + rcu_read_lock();
>>>> + sock = rcu_dereference_sk_user_data(sk);
>>>> + rcu_read_unlock();
>>>
>>> You can't just release rcu_read_lock and keep using sock (here and in
>>> the rest of this file). Either you keep rcu_read_lock, or you can take
>>> a reference on the ovpn_socket.
>>
>> I was just staring at this today, after having worked on the
>> rcu_read_lock/unlock for the peer get()s..
>>
>> I thinkt the assumption was: if we are in this read_sock callback, it's
>> impossible that the ovpn_socket was invalidated, because it gets invalidated
>> upon detach, which also prevents any further calling of this callback. But
>> this sounds racy, and I guess we should somewhat hold a reference..
>
> ovpn_tcp_read_sock starts
>
> detach
> kfree_rcu(ovpn_socket)
> ...
> ovpn_socket actually freed
> ...
> ovpn_tcp_read_sock continues with freed ovpn_socket
>
>
> I don't think anything in the current code prevents this.
mh yeah, if something like this happens right after having started the
read_sock we are doomed.
Will fix this.
>
>
>>>> +/* Set TCP encapsulation callbacks */
>>>> +int ovpn_tcp_socket_attach(struct socket *sock, struct ovpn_peer *peer)
>>>> +{
>>>> + void *old_data;
>>>> + int ret;
>>>> +
>>>> + INIT_WORK(&peer->tcp.tx_work, ovpn_tcp_tx_work);
>>>> +
>>>> + ret = ptr_ring_init(&peer->tcp.tx_ring, OVPN_QUEUE_LEN, GFP_KERNEL);
>>>> + if (ret < 0) {
>>>> + netdev_err(peer->ovpn->dev, "cannot allocate TCP TX ring\n");
>>>> + return ret;
>>>> + }
>>>> +
>>>> + peer->tcp.skb = NULL;
>>>> + peer->tcp.offset = 0;
>>>> + peer->tcp.data_len = 0;
>>>> +
>>>> + write_lock_bh(&sock->sk->sk_callback_lock);
>>>> +
>>>> + /* make sure no pre-existing encapsulation handler exists */
>>>> + rcu_read_lock();
>>>> + old_data = rcu_dereference_sk_user_data(sock->sk);
>>>> + rcu_read_unlock();
>>>> + if (old_data) {
>>>> + netdev_err(peer->ovpn->dev,
>>>> + "provided socket already taken by other user\n");
>>>> + ret = -EBUSY;
>>>> + goto err;
>>>
>>> The UDP code differentiates "socket already owned by this interface"
>>> from "already taken by other user". That doesn't apply to TCP?
>>
>> This makes me wonder: how safe it is to interpret the user data as an object
>> of type ovpn_socket?
>>
>> When we find the user data already assigned, we don't know what was really
>> stored in there, right?
>> Technically this socket could have gone through another module which
>> assigned its own state.
>>
>> Therefore I think that what UDP does [ dereferencing ((struct ovpn_socket
>> *)user_data)->ovpn ] is probably not safe. Would you agree?
>
> Hmmm, yeah, I think you're right. If you checked encap_type ==
> UDP_ENCAP_OVPNINUDP before (sk_prot for TCP), then you'd know it's
> really your data. Basically call ovpn_from_udp_sock during attach if
> you want to check something beyond EBUSY.
right. Maybe we can leave with simply reporting EBUSY and be done with
it, without adding extra checks and what not.
>
> Once you're in your own callbacks, it should be safe. If some other
> code sends packet with a non-ovpn socket to ovpn's ->encap_rcv,
> something is really broken.
yup
>
>>>> +int __init ovpn_tcp_init(void)
>>>> +{
>>>> + /* We need to substitute the recvmsg and the sock_is_readable
>>>> + * callbacks in the sk_prot member of the sock object for TCP
>>>> + * sockets.
>>>> + *
>>>> + * However sock->sk_prot is a pointer to a static variable and
>>>> + * therefore we can't directly modify it, otherwise every socket
>>>> + * pointing to it will be affected.
>>>> + *
>>>> + * For this reason we create our own static copy and modify what
>>>> + * we need. Then we make sk_prot point to this copy
>>>> + * (in ovpn_tcp_socket_attach())
>>>> + */
>>>> + ovpn_tcp_prot = tcp_prot;
>>>
>>> Don't you need a separate variant for IPv6, like TLS does?
>>
>> Never did so far.
>>
>> My wild wild wild guess: for the time this socket is owned by ovpn, we only
>> use callbacks that are IPvX agnostic, hence v4 vs v6 doesn't make any
>> difference.
>> When this socket is released, we reassigned the original prot.
>
> That seems a bit suspicious to me. For example, tcpv6_prot has a
> different backlog_rcv. And you don't control if the socket is detached
> before being closed, or which callbacks are needed. Your userspace
> client doesn't use them, but someone else's might.
>
>>>> + ovpn_tcp_prot.recvmsg = ovpn_tcp_recvmsg;
>>>
>>> You don't need to replace ->sendmsg as well? The userspace client is
>>> not expected to send messages?
>>
>> It is, but my assumption is that those packets will just go through the
>> socket as usual. No need to be handled by ovpn (those packets are not
>> encrypted/decrypted, like data traffic is).
>> And this is how it has worked so far.
>>
>> Makes sense?
>
> Two things come to mind:
>
> - userspace is expected to prefix the messages it inserts on the
> stream with the 2-byte length field? otherwise, the peer won't be
> able to parse them out of the stream
correct. userspace sends those packets as if ovpn is not running,
therefore this happens naturally.
>
> - I'm not convinced this would be safe wrt kernel writing partial
> messages. if ovpn_tcp_send_one doesn't send the full message, you
> could interleave two messages:
>
> +------+-------------------+------+--------+----------------+
> | len1 | (bytes from msg1) | len2 | (msg2) | (rest of msg1) |
> +------+-------------------+------+--------+----------------+
>
> and the RX side would parse that as:
>
> +------+-----------------------------------+------+---------
> | len1 | (bytes from msg1) | len2 | (msg2) | ???? | ...
> +------+-------------------+---------------+------+---------
>
> and try to interpret some random bytes out of either msg1 or msg2 as
> a length prefix, resulting in a broken stream.
hm you are correct. if multiple sendmsg can overlap, then we might be in
troubles, but are we sure this can truly happen?
>
>
> The stream format looks identical to ESP in TCP [1] (2B length prefix
> followed by the actual message), so I think the espintcp code (both tx
> and rx, except for actual protocol parsing) should look very
> similar. The problems that need to be solved for both protocols are
> pretty much the same.
ok, will have a look. maybe this will simplify the code even more and we
will get rid of some of the issues we were discussing above.
Thanks!
>
> [1] https://www.rfc-editor.org/rfc/rfc8229#section-3
>
--
Antonio Quartulli
OpenVPN Inc.
Powered by blists - more mailing lists