[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241209052643.9896-1-kuniyu@amazon.com>
Date: Mon, 9 Dec 2024 14:26:43 +0900
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <david.laight@...lab.com>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
<kuni1840@...il.com>, <kuniyu@...zon.com>, <netdev@...r.kernel.org>,
<pabeni@...hat.com>
Subject: Re: [PATCH v1 net-next 00/15] treewide: socket: Clean up sock_create() and friends.
From: David Laight <David.Laight@...LAB.COM>
Date: Sun, 8 Dec 2024 10:54:23 +0000
> From: Kuniyuki Iwashima
> > Sent: 06 December 2024 07:55
> >
> > There are a bunch of weird usages of sock_create() and friends due
> > to poor documentation.
> >
> > 1) some subsystems use __sock_create(), but all of them can be
> > replaced with sock_create_kern()
>
> Not currently because sock_create_kern() doesn't increase the ref count
> on the network namespace.
> So I have an out of tree driver that end up using __sock_create() in
> order that the socket holds a reference to the network namespace.
An out of tree driver is apparently out of scope, and the owner always
needs to follow the upstream change.
That's one of reasons I changed the arguments and renamed sock_create()
helpers, letting the out-of-tree drivers notice the change through build
failure.
>
> > 2) some subsystems use sock_create(), but most of the sockets are
> > not exposed to userspace via file descriptors but are exposed
> > to some BPF hooks (most likely unintentionally)
>
> AFAIR the 'kern' flag removes some security/permission checks.
Right.
> So a socket that is being created to handle user API requests
> (user data might be encapsulated and then sent) probably ought
> to have 'kern == 0' even though the socket is not directly exposed
> through the fd table.
tun/tap is counterexample, where sk_alloc(kern=0) is directly
used instead of sock_create(), then no LSM is invoked.
>
> > 3) some subsystems use sock_create_kern() and convert the sockets
> > to hold netns refcnt (cifs, mptcp, rds, smc, and sunrpc)
> >
> > 4) the sockets of 2) and 3) are counted in /proc/net/sockstat even
> > though they are untouchable from userspace
> >
> > The primary goal is to sort out such confusion and provide enough
> > documentation for future developers to choose an appropriate API.
> >
> > Regarding 3), we introduce a new API, sock_create_net(), that holds
> > a netns refcnt for kernel socket to remove the socket conversion to
> > avoid use-after-free triggered by TCP kernel socket after commit
> > 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns
> > of kernel sockets.").
> >
> > Throughout the series, we follow the definition below:
> >
> > userspace socket:
> > * created by sock_create_user()
> > * holds the reference count of the network namespace
> > * directly linked to a file descriptor
> > * accessed via a file descriptor (and some BPF hooks)
> > * counted in the first line of /proc/net/sockstat.
> >
> > kernel socket
> > * created by sock_create_net() or sock_create_net_noref()
> > * the former holds the refcnt of netns, but the latter doesn't
> > * not directly exposed to userspace via a file descriptor nor BPF
>
> That isn't really right:
> A 'userspace socket' (kern == 0) is a socket created to support a user's
> API calls. Typically (but not always) directly linked to a file descriptor.
> Security/permission checks include those of the current user/process.
>
> A 'kernel socket' (kern == 1) is a socket that isn't related to a user process.
> These fall into two groups:
> 1) Normal TCP (etc) sockets used by things like remote filesystems.
> These need to hold a reference to the network namespace and will stop
> the namespace being deleted.
You'll see all of the sockets using sock_create() in the patch below
are of this type. (infiniband, ISDN, NVMe over TCP, iscsi, Xen PV call,
ocfs2, smbd)
https://lore.kernel.org/netdev/20241206075504.24153-14-kuniyu@amazon.com/
Only one valid user of sock_create() is SCTP, and it's "created to support
users's API call" and actually tied to a file descriptor.
https://lore.kernel.org/netdev/20241206075504.24153-15-kuniyu@amazon.com/
>From this POV, I think tun/tap also should use kern=true && hold_net=true.
No 'socket' is mentioned in tun/tap doc. Users need not be aware of the
underlying sockets.
$ grep socket Documentation/networking/tuntap.rst
$ echo $?
1
> 2) Special sockets used internally (perhaps for routing).
> These don't hold a reference, but require the caller have a callback
> for the namespace being deleted and must close the socket before
> the callback returns.
> The close must be fully synchronous - FIN_WAIT states will break things.
>
> > Note that I didn't CC maintainers for mechanical changes as the CC
> > list explodes.
>
> This whole change (which is probably worth it) need doing in a different
> order.
I don't think it's worth it, and you are inconsistent.
This series is a revival of my old patch, where I tried to
override the kern parameter.
Interestingly, you suggested adding another parameter then,
that was opposite of what you are suggesting now.
https://lore.kernel.org/netdev/20240227011041.97375-4-kuniyu@amazon.com/
> Start with the actual change to the socket create code and then work
> back through the callers.
> That may require adding wrapper functions for the existing calls that
> can then finally be deleted in the last patch.
>
> Additionally boolean parameters need to be banned, multiple ones
> are worse - you really can tell from the call site what they mean.
As mentioned above, this is intentional to break out-of-tree drivers
and give the owners a chance to look at this change.
For example, we provide Lustre service with an out-of-tree driver
where sock_crete_kern() is used and needs update along future kernel
update.
Also, changing param let compilers inspect all sock_create_XXX()
users, and there was a weird path like smc_ulp_init().
https://lore.kernel.org/netdev/20241206075504.24153-4-kuniyu@amazon.com/
>
> So change the boolean 'kern' parameter to a flags one.
> You then want 0 to be the normal case (user socket that holds a ref).
> Then add (say):
> #define SOCK_CREATE_NO_NS_REF 1
> #define SOCK_CREATE_KERN 2
> Initially add to the top of sk_alloc():
> if (flags & SOCK_CREATE_NOREF)
> flags |= SOCK_CREATE_KERN;
> Now all the code compiles and works as before.
>
> Next find all the call sites (especially those that pass 1)
> and change so they pass the correct flag combination.
Changing param allows us to effectively inspect these callers with
help of the compiler.
> (Some static inlines/#defines might help with long lines.)
>
> Finally change sk_alloc() to return NULL if NO_NS_REF
> is set without KERN.
This is no-go, no one wants to debug a weird -ENOMEM.
I added DEBUG_NET_WARN_ON_ONCE() for such a case so that
syzbot will let us know that pattern
https://lore.kernel.org/netdev/20241206075504.24153-10-kuniyu@amazon.com/
Powered by blists - more mailing lists