[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKHE=ucDztegP5TPHM=6e0JaGApQ7J==4r1Q9nffm+-sQ@mail.gmail.com>
Date: Tue, 10 Dec 2024 09:46:27 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org
Subject: Re: [PATCH v2 net-next 00/15] treewide: socket: Clean up
sock_create() and friends.
On Tue, Dec 10, 2024 at 8:38 AM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
>
> There are a bunch of weird usages of sock_create() and friends due
> to poor documentation.
>
> 1) some subsystems use __sock_create(), but all of them can be
> replaced with sock_create_kern()
>
> 2) some subsystems use sock_create(), but most of the sockets are
> not tied to userspace processes nor exposed via file descriptors
> but are (most likely unintentionally) exposed to some BPF hooks
> (infiniband, ISDN, NVMe over TCP, iscsi, Xen PV call, ocfs2, smbd)
>
> 3) some subsystems use sock_create_kern() and convert the sockets
> to hold netns refcnt (cifs, mptcp, rds, smc, and sunrpc)
>
> 4) the sockets of 2) and 3) are counted in /proc/net/sockstat even
> though they are untouchable from userspace
>
> The primary goal is to sort out such confusion and provide enough
> documentation for future developers to choose an appropriate API.
>
> Regarding 3), we introduce a new API, sock_create_net(), that holds
> a netns refcnt for kernel socket to remove the socket conversion to
> avoid use-after-free triggered by TCP kernel socket after commit
> 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns
> of kernel sockets.").
>
> Finally, we rename sock_create() and sock_create_kern() to
> sock_create_user() and sock_create_net_noref(), respectively.
> This intentionally breaks out-of-tree drivers to give the owners
> a chance to choose an appropriate API.
>
> Throughout the series, we follow the definition below:
>
> userspace socket:
> * created by sock_create_user()
> * holds the reference count of the network namespace
> * directly linked to a file descriptor
> * currently all sockets created by sane sock_create() users
> are tied to userspace process and exposed via file descriptors
> * accessed via a file descriptor (and some BPF hooks except
> for BPF LSM)
> * counted in the first line of /proc/net/sockstat.
>
> kernel socket
> * created by sock_create_net() or sock_create_net_noref()
> * the former holds the refcnt of netns, but the latter doesn't
> * not directly exposed to userspace via a file descriptor nor BPF
> except for BPF LSM
>
> Note that __sock_create(kern=1) skips some LSMs (SELinux, AppArmor)
> but not all; BPF LSM can enforce security regardless of the argument.
>
> I didn't CC maintainers for mechanical changes as the CC list explodes.
>
>
> Changes:
> v2:
> * Patch 8
> * Fix build error for PF_IUCV
> * Patch 12
> * Collect Acked-by from MPTCP/RDS maintainers
>
> v1: https://lore.kernel.org/netdev/20241206075504.24153-1-kuniyu@amazon.com/
>
My concern with this huge refactoring is the future backports hassle,
before going to the review part...
Also the change about counting or not 'kernel sockets' in
/proc/net/sockstat is orthogonal.
Powered by blists - more mailing lists