netdev - Re: [PATCH net-next 10/10] ovpn: ensure sk is still valid during cleanup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f0d7cbd7-6bc2-4034-b912-17c3a1959021@openvpn.net>
Date: Tue, 13 May 2025 11:19:29 +0200
From: Antonio Quartulli <antonio@...nvpn.net>
To: Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
 Sabrina Dubroca <sd@...asysnail.net>, Al Viro <viro@...iv.linux.org.uk>,
 Qingfang Deng <dqfext@...il.com>, Gert Doering <gert@...enie.muc.de>
Subject: Re: [PATCH net-next 10/10] ovpn: ensure sk is still valid during
 cleanup

On 13/05/2025 10:21, Paolo Abeni wrote:
> 
> 
> On 5/13/25 3:37 AM, Jakub Kicinski wrote:
>> On Fri,  9 May 2025 16:26:20 +0200 Antonio Quartulli wrote:
>>> In case of UDP peer timeout, an openvpn client (userspace)
>>> performs the following actions:
>>> 1. receives the peer deletion notification (reason=timeout)
>>> 2. closes the socket
>>>
>>> Upon 1. we have the following:
>>> - ovpn_peer_keepalive_work()
>>>   - ovpn_socket_release()
>>>    - synchronize_rcu()
>>> At this point, 2. gets a chance to complete and ovpn_sock->sock->sk
>>> becomes NULL. ovpn_socket_release() will then attempt dereferencing it,
>>> resulting in the following crash log:
>>
>> What runs where is a bit unclear to me. Specifically I'm not sure what
>> runs the code under the "if (released)" branch of ovpn_socket_release()
>> if the user closes the socket. Because you now return without a WARN().
>>
>>> @@ -75,13 +76,14 @@ void ovpn_socket_release(struct ovpn_peer *peer)
>>>   	if (!sock)
>>>   		return;
>>>   
>>> -	/* sanity check: we should not end up here if the socket
>>> -	 * was already closed
>>> +	/* sock->sk may be released concurrently, therefore we
>>> +	 * first attempt grabbing a reference.
>>> +	 * if sock->sk is NULL it means it is already being
>>> +	 * destroyed and we don't need any further cleanup
>>>   	 */
>>> -	if (!sock->sock->sk) {
>>> -		DEBUG_NET_WARN_ON_ONCE(1);
>>> +	sk = sock->sock->sk;
>>> +	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
>>
>> How is sk protected from getting reused here?
>> refcount_inc_not_zero() still needs the underlying object to be allocated.
>> I don't see any locking here, and code says this function may sleep so
>> it can't be called under RCU, either.
> 
> I agree this still looks racy. When the socket close runs, nobody else
> should have access/reference to the 'struct socket'. I'm under the
> impression that ovpn_socket should acquire references to the underlying
> fd instead of keeping its own refcount.

This is what we were originally doing, but since the socket is not a 
"kernel socket", increasing the refcount was preventing us from 
understanding when the socket was supposed to be destroyed (because ovpn 
itself was still holding a ref).
Hence we switched to this model where we get notified about the socket 
going away via close()/destroy() call.


I think ovpn_socket should coordinate access to its sock member and 
nullify it during destroy (which is invoked by sk_common_release()).
At that point no other part of the code will have a chance to access it.

I am gonna play with this idea right now.


> 
> Side note: the ovpn_socket refcount release/detach path looks wrong, at
> least in case of an UDP socket, as ovpn_udp_socket_detach() calls
> setup_udp_tunnel_sock() which in turns will try to _increment_ various
> core counters, instead of decreasing them (i.e. udp_encap_enable should
> be wrongly accounted after that call).

You're right.
I had the impression I needed to "undo" the setup.
I see now that the encap key is decremented in the UDP sock destroy, 
right after having called my implementation of .destroy().

I'll drop the call to setup_udp_tunnel_sock() with empty config then.

Regards,

> 
> /P
> 

-- 
Antonio Quartulli
OpenVPN Inc.