[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85ad278ad61943938a6537f1405f7814@AcuMS.aculab.com>
Date: Sun, 8 Dec 2024 10:54:23 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Kuniyuki Iwashima' <kuniyu@...zon.com>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski
<kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
CC: Kuniyuki Iwashima <kuni1840@...il.com>, "netdev@...r.kernel.org"
<netdev@...r.kernel.org>
Subject: RE: [PATCH v1 net-next 00/15] treewide: socket: Clean up
sock_create() and friends.
From: Kuniyuki Iwashima
> Sent: 06 December 2024 07:55
>
> There are a bunch of weird usages of sock_create() and friends due
> to poor documentation.
>
> 1) some subsystems use __sock_create(), but all of them can be
> replaced with sock_create_kern()
Not currently because sock_create_kern() doesn't increase the ref count
on the network namespace.
So I have an out of tree driver that end up using __sock_create() in
order that the socket holds a reference to the network namespace.
> 2) some subsystems use sock_create(), but most of the sockets are
> not exposed to userspace via file descriptors but are exposed
> to some BPF hooks (most likely unintentionally)
AFAIR the 'kern' flag removes some security/permission checks.
So a socket that is being created to handle user API requests
(user data might be encapsulated and then sent) probably ought
to have 'kern == 0' even though the socket is not directly exposed
through the fd table.
> 3) some subsystems use sock_create_kern() and convert the sockets
> to hold netns refcnt (cifs, mptcp, rds, smc, and sunrpc)
>
> 4) the sockets of 2) and 3) are counted in /proc/net/sockstat even
> though they are untouchable from userspace
>
> The primary goal is to sort out such confusion and provide enough
> documentation for future developers to choose an appropriate API.
>
> Regarding 3), we introduce a new API, sock_create_net(), that holds
> a netns refcnt for kernel socket to remove the socket conversion to
> avoid use-after-free triggered by TCP kernel socket after commit
> 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns
> of kernel sockets.").
>
> Throughout the series, we follow the definition below:
>
> userspace socket:
> * created by sock_create_user()
> * holds the reference count of the network namespace
> * directly linked to a file descriptor
> * accessed via a file descriptor (and some BPF hooks)
> * counted in the first line of /proc/net/sockstat.
>
> kernel socket
> * created by sock_create_net() or sock_create_net_noref()
> * the former holds the refcnt of netns, but the latter doesn't
> * not directly exposed to userspace via a file descriptor nor BPF
That isn't really right:
A 'userspace socket' (kern == 0) is a socket created to support a user's
API calls. Typically (but not always) directly linked to a file descriptor.
Security/permission checks include those of the current user/process.
A 'kernel socket' (kern == 1) is a socket that isn't related to a user process.
These fall into two groups:
1) Normal TCP (etc) sockets used by things like remote filesystems.
These need to hold a reference to the network namespace and will stop
the namespace being deleted.
2) Special sockets used internally (perhaps for routing).
These don't hold a reference, but require the caller have a callback
for the namespace being deleted and must close the socket before
the callback returns.
The close must be fully synchronous - FIN_WAIT states will break things.
> Note that I didn't CC maintainers for mechanical changes as the CC
> list explodes.
This whole change (which is probably worth it) need doing in a different
order.
Start with the actual change to the socket create code and then work
back through the callers.
That may require adding wrapper functions for the existing calls that
can then finally be deleted in the last patch.
Additionally boolean parameters need to be banned, multiple ones
are worse - you really can tell from the call site what they mean.
So change the boolean 'kern' parameter to a flags one.
You then want 0 to be the normal case (user socket that holds a ref).
Then add (say):
#define SOCK_CREATE_NO_NS_REF 1
#define SOCK_CREATE_KERN 2
Initially add to the top of sk_alloc():
if (flags & SOCK_CREATE_NOREF)
flags |= SOCK_CREATE_KERN;
Now all the code compiles and works as before.
Next find all the call sites (especially those that pass 1)
and change so they pass the correct flag combination.
(Some static inlines/#defines might help with long lines.)
Finally change sk_alloc() to return NULL if NO_NS_REF
is set without KERN.
None of the 'pass through' functions need changing except to
ensure the parameter is 'unsigned int flags' (not 'bool kern').
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists