[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241210094716.76970-1-kuniyu@amazon.com>
Date: Tue, 10 Dec 2024 18:47:16 +0900
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <edumazet@...gle.com>
CC: <davem@...emloft.net>, <horms@...nel.org>, <kuba@...nel.org>,
<kuni1840@...il.com>, <kuniyu@...zon.com>, <netdev@...r.kernel.org>,
<pabeni@...hat.com>
Subject: Re: [PATCH v2 net-next 00/15] treewide: socket: Clean up sock_create() and friends.
From: Eric Dumazet <edumazet@...gle.com>
Date: Tue, 10 Dec 2024 09:46:27 +0100
> On Tue, Dec 10, 2024 at 8:38 AM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> >
> > There are a bunch of weird usages of sock_create() and friends due
> > to poor documentation.
> >
> > 1) some subsystems use __sock_create(), but all of them can be
> > replaced with sock_create_kern()
> >
> > 2) some subsystems use sock_create(), but most of the sockets are
> > not tied to userspace processes nor exposed via file descriptors
> > but are (most likely unintentionally) exposed to some BPF hooks
> > (infiniband, ISDN, NVMe over TCP, iscsi, Xen PV call, ocfs2, smbd)
> >
> > 3) some subsystems use sock_create_kern() and convert the sockets
> > to hold netns refcnt (cifs, mptcp, rds, smc, and sunrpc)
> >
> > 4) the sockets of 2) and 3) are counted in /proc/net/sockstat even
> > though they are untouchable from userspace
> >
> > The primary goal is to sort out such confusion and provide enough
> > documentation for future developers to choose an appropriate API.
> >
> > Regarding 3), we introduce a new API, sock_create_net(), that holds
> > a netns refcnt for kernel socket to remove the socket conversion to
> > avoid use-after-free triggered by TCP kernel socket after commit
> > 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns
> > of kernel sockets.").
> >
> > Finally, we rename sock_create() and sock_create_kern() to
> > sock_create_user() and sock_create_net_noref(), respectively.
> > This intentionally breaks out-of-tree drivers to give the owners
> > a chance to choose an appropriate API.
> >
> > Throughout the series, we follow the definition below:
> >
> > userspace socket:
> > * created by sock_create_user()
> > * holds the reference count of the network namespace
> > * directly linked to a file descriptor
> > * currently all sockets created by sane sock_create() users
> > are tied to userspace process and exposed via file descriptors
> > * accessed via a file descriptor (and some BPF hooks except
> > for BPF LSM)
> > * counted in the first line of /proc/net/sockstat.
> >
> > kernel socket
> > * created by sock_create_net() or sock_create_net_noref()
> > * the former holds the refcnt of netns, but the latter doesn't
> > * not directly exposed to userspace via a file descriptor nor BPF
> > except for BPF LSM
> >
> > Note that __sock_create(kern=1) skips some LSMs (SELinux, AppArmor)
> > but not all; BPF LSM can enforce security regardless of the argument.
> >
> > I didn't CC maintainers for mechanical changes as the CC list explodes.
> >
> >
> > Changes:
> > v2:
> > * Patch 8
> > * Fix build error for PF_IUCV
> > * Patch 12
> > * Collect Acked-by from MPTCP/RDS maintainers
> >
> > v1: https://lore.kernel.org/netdev/20241206075504.24153-1-kuniyu@amazon.com/
> >
>
> My concern with this huge refactoring is the future backports hassle,
> before going to the review part...
I understand that concern as a distributor, that's why I waited the
sunrpc fix to be merged into net-next, which I believe is the last
problematic caller of sock_create_kern(AF_INET6?, SOCK_STREAM).
Also, I think socket()/accept()/sk_alloc() are unlikely to have
many bugs and get updated so frequently.
$ git log --oneline v5.4.286...v5.4 -- net/socket.c | wc -l
13
$ git log --oneline v5.10.230...v5.10 -- net/socket.c | wc -l
13
$ git log --oneline v5.15.173...v5.15 -- net/socket.c | wc -l
8
$ git log --oneline v6.1.119...v6.1 -- net/socket.c | wc -l
13
b7d22a79ff4e ("net: explicitly clear the sk pointer, when pf->create
fails") is the only patch touching sock_create(). (in 5.15+)
Recently, I backported ef7134c7fc48 ("smb: client: Fix use-after-free
of network namespace.") to our 5.4/5.10/5.15/6.1 trees (and will post
them to upstream stable trees) and it didn't have conflicts in terms
of networking code.
>
> Also the change about counting or not 'kernel sockets' in
> /proc/net/sockstat is orthogonal.
I dug a bit more history.
/proc/net/sockstat was initially a global counter and later
it's namespacified by 648845ab7e200 in 2017.
Before the commit, the global counter included all sockets.
But 26abe14379f8 changed kernel sockets not to hold netns
refcount in 2015, so we can't decrement the namespacified
counter during socket destruction for kernel sockets.
Then, the namespacified sockstat was changed to include sockets
holding netns refcnt, and it was only userspace sockets until the
first kernel socket conversion.
So, the option would be either of
1) keep as is (counting userspace sockets and kernel sockets
with net refcnt)
2) count userspace sockets only
3) count all sockets (and detect callers of sock_create_kern()
who touches kernel sockets after netns is freed, with help
of syzbot and KMSAN/KASAN)
3) is invasive but seems doable considering now we have
ref_tracker_dir_exit() for kernel sockets in net_free().
What do you think ?
Powered by blists - more mailing lists