lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241209052643.9896-1-kuniyu@amazon.com>
Date: Mon, 9 Dec 2024 14:26:43 +0900
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <david.laight@...lab.com>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
	<kuni1840@...il.com>, <kuniyu@...zon.com>, <netdev@...r.kernel.org>,
	<pabeni@...hat.com>
Subject: Re: [PATCH v1 net-next 00/15] treewide: socket: Clean up sock_create() and friends.

From: David Laight <David.Laight@...LAB.COM>
Date: Sun, 8 Dec 2024 10:54:23 +0000
> From: Kuniyuki Iwashima
> > Sent: 06 December 2024 07:55
> > 
> > There are a bunch of weird usages of sock_create() and friends due
> > to poor documentation.
> > 
> >   1) some subsystems use __sock_create(), but all of them can be
> >      replaced with sock_create_kern()
> 
> Not currently because sock_create_kern() doesn't increase the ref count
> on the network namespace.
> So I have an out of tree driver that end up using __sock_create() in
> order that the socket holds a reference to the network namespace.

An out of tree driver is apparently out of scope, and the owner always
needs to follow the upstream change.

That's one of reasons I changed the arguments and renamed sock_create()
helpers, letting the out-of-tree drivers notice the change through build
failure.


> 
> >   2) some subsystems use sock_create(), but most of the sockets are
> >      not exposed to userspace via file descriptors but are exposed
> >      to some BPF hooks (most likely unintentionally)
> 
> AFAIR the 'kern' flag removes some security/permission checks.

Right.


> So a socket that is being created to handle user API requests
> (user data might be encapsulated and then sent) probably ought
> to have 'kern == 0' even though the socket is not directly exposed
> through the fd table.

tun/tap is counterexample, where sk_alloc(kern=0) is directly
used instead of sock_create(), then no LSM is invoked.


> 
> >   3) some subsystems use sock_create_kern() and convert the sockets
> >      to hold netns refcnt (cifs, mptcp, rds, smc, and sunrpc)
> > 
> >   4) the sockets of 2) and 3) are counted in /proc/net/sockstat even
> >      though they are untouchable from userspace
> > 
> > The primary goal is to sort out such confusion and provide enough
> > documentation for future developers to choose an appropriate API.
> > 
> > Regarding 3), we introduce a new API, sock_create_net(), that holds
> > a netns refcnt for kernel socket to remove the socket conversion to
> > avoid use-after-free triggered by TCP kernel socket after commit
> > 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns
> > of kernel sockets.").
> > 
> > Throughout the series, we follow the definition below:
> > 
> >   userspace socket:
> >     * created by sock_create_user()
> >     * holds the reference count of the network namespace
> >     * directly linked to a file descriptor
> >     * accessed via a file descriptor (and some BPF hooks)
> >     * counted in the first line of /proc/net/sockstat.
> > 
> >   kernel socket
> >     * created by sock_create_net() or sock_create_net_noref()
> >       * the former holds the refcnt of netns, but the latter doesn't
> >     * not directly exposed to userspace via a file descriptor nor BPF
> 
> That isn't really right:
> A 'userspace socket' (kern == 0) is a socket created to support a user's
> API calls. Typically (but not always) directly linked to a file descriptor.
> Security/permission checks include those of the current user/process.
> 
> A 'kernel socket' (kern == 1) is a socket that isn't related to a user process.
> These fall into two groups:
> 1) Normal TCP (etc) sockets used by things like remote filesystems.
>    These need to hold a reference to the network namespace and will stop
>    the namespace being deleted.

You'll see all of the sockets using sock_create() in the patch below
are of this type.  (infiniband, ISDN, NVMe over TCP, iscsi, Xen PV call,
ocfs2, smbd)
https://lore.kernel.org/netdev/20241206075504.24153-14-kuniyu@amazon.com/

Only one valid user of sock_create() is SCTP, and it's "created to support
users's API call" and actually tied to a file descriptor.
https://lore.kernel.org/netdev/20241206075504.24153-15-kuniyu@amazon.com/

>From this POV, I think tun/tap also should use kern=true && hold_net=true.
No 'socket' is mentioned in tun/tap doc.  Users need not be aware of the
underlying sockets.

  $ grep socket Documentation/networking/tuntap.rst
  $ echo $?
  1

> 2) Special sockets used internally (perhaps for routing).
>    These don't hold a reference, but require the caller have a callback
>    for the namespace being deleted and must close the socket before
>    the callback returns.
>    The close must be fully synchronous - FIN_WAIT states will break things.
> 
> > Note that I didn't CC maintainers for mechanical changes as the CC
> > list explodes.
> 
> This whole change (which is probably worth it) need doing in a different
> order.

I don't think it's worth it, and you are inconsistent.

This series is a revival of my old patch, where I tried to
override the kern parameter.

Interestingly, you suggested adding another parameter then,
that was opposite of what you are suggesting now.
https://lore.kernel.org/netdev/20240227011041.97375-4-kuniyu@amazon.com/


> Start with the actual change to the socket create code and then work
> back through the callers.
> That may require adding wrapper functions for the existing calls that
> can then finally be deleted in the last patch.
> 
> Additionally boolean parameters need to be banned, multiple ones
> are worse - you really can tell from the call site what they mean.

As mentioned above, this is intentional to break out-of-tree drivers
and give the owners a chance to look at this change.

For example, we provide Lustre service with an out-of-tree driver
where sock_crete_kern() is used and needs update along future kernel
update.

Also, changing param let compilers inspect all sock_create_XXX()
users, and there was a weird path like smc_ulp_init().
https://lore.kernel.org/netdev/20241206075504.24153-4-kuniyu@amazon.com/


> 
> So change the boolean 'kern' parameter to a flags one.
> You then want 0 to be the normal case (user socket that holds a ref).
> Then add (say):
> 	#define SOCK_CREATE_NO_NS_REF 1
> 	#define SOCK_CREATE_KERN      2
> Initially add to the top of sk_alloc():
> 	if (flags & SOCK_CREATE_NOREF)
> 		flags |= SOCK_CREATE_KERN;
> Now all the code compiles and works as before.
> 
> Next find all the call sites (especially those that pass 1)
> and change so they pass the correct flag combination.

Changing param allows us to effectively inspect these callers with
help of the compiler.


> (Some static inlines/#defines might help with long lines.)
> 
> Finally change sk_alloc() to return NULL if NO_NS_REF
> is set without KERN.

This is no-go, no one wants to debug a weird -ENOMEM.

I added DEBUG_NET_WARN_ON_ONCE() for such a case so that
syzbot will let us know that pattern
https://lore.kernel.org/netdev/20241206075504.24153-10-kuniyu@amazon.com/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ