[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJYF6S3XcfnxNcsPMjhFXz1naokJ+tLM1jSjjR6uco9bw@mail.gmail.com>
Date: Wed, 12 Oct 2022 09:31:44 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>,
Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org,
syzbot <syzkaller@...glegroups.com>
Subject: Re: [PATCH v1 net] tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
On Wed, Oct 12, 2022 at 7:51 AM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
>
> Eric Dumazet reported a use-after-free related to the per-netns ehash
> series. [0]
>
> When we create a TCP socket from userspace, the socket always holds a
> refcnt of the netns. This guarantees that a reqsk timer is always fired
> before netns dismantle. Each reqsk has a refcnt of its listener, so the
> listener is not freed before the reqsk, and the net is not freed before
> the listener as well.
>
> OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
> of its netns. Thus, a reqsk timer can be fired after the netns dismantle
> and access freed per-netns ehash.
Patch seems good, but changelog is incorrect.
1) we have a TCP listener (or more) on a netns
2) We receive SYN packets, creating SYN_RECV request sockets, added in
ehash table.
3) job is killed, TCP listener closed.
4) When a TCP listener is closed, we do not purge all SYN_RECV
requests sockets, because we rely
on normal per-request timer firing, then finding the listener is no
longer in LISTEN state -> drop the request socket.
(We do not maintain a per-listener list of request sockets, and
going through ehash would be quite expensive on busy servers)
5) netns is deleted (and optional TCP ehashinfo freed)
6) request socket timer fire, and wecrash while trying to unlink the
request socket from the freed ehash table.
In short, I think the case could happen with normal TCP sockets,
allocated from user space.
>
> To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
> in inet_twsk_purge() if the netns uses a per-netns ehash.
>
> [0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
>
> BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
> include/net/inet_hashtables.h:181 [inline]
> BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
> net/ipv4/inet_connection_sock.c:913
> Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
>
> CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
> 6.0.0-syzkaller-02757-gaf7d23f9d96a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 09/22/2022
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
> print_address_description mm/kasan/report.c:317 [inline]
> print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
> kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
> tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
> reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
> inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
> inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
> reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
> call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
> expire_timers kernel/time/timer.c:1519 [inline]
> __run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
> __run_timers kernel/time/timer.c:1768 [inline]
> run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
> __do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
> invoke_softirq kernel/softirq.c:445 [inline]
> __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
> irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
> sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
> </IRQ>
>
> Fixes: d1e5e6408b30 ("tcp: Introduce optional per-netns ehash.")
> Reported-by: syzbot <syzkaller@...glegroups.com>
> Reported-by: Eric Dumazet <edumazet@...gle.com>
> Suggested-by: Eric Dumazet <edumazet@...gle.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> ---
> net/ipv4/inet_timewait_sock.c | 15 ++++++++++++++-
> net/ipv4/tcp_minisocks.c | 9 +++++----
> 2 files changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 71d3bb0abf6c..66fc940f9521 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -268,8 +268,21 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo, int family)
> rcu_read_lock();
> restart:
> sk_nulls_for_each_rcu(sk, node, &head->chain) {
> - if (sk->sk_state != TCP_TIME_WAIT)
> + if (sk->sk_state != TCP_TIME_WAIT) {
> + /* A kernel listener socket might not hold refcnt for net,
> + * so reqsk_timer_handler() could be fired after net is
> + * freed. Userspace listener and reqsk never exist here.
> + */
> + if (unlikely(sk->sk_state == TCP_NEW_SYN_RECV &&
> + hashinfo->pernet)) {
> + struct request_sock *req = inet_reqsk(sk);
> +
> + inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
> + }
> +
> continue;
> + }
> +
> tw = inet_twsk(sk);
> if ((tw->tw_family != family) ||
> refcount_read(&twsk_net(tw)->ns.count))
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 79f30f026d89..c375f603a16c 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -353,13 +353,14 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family)
> struct net *net;
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> - /* The last refcount is decremented in tcp_sk_exit_batch() */
> - if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> - continue;
> -
> if (net->ipv4.tcp_death_row.hashinfo->pernet) {
> + /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
> } else if (!purged_once) {
> + /* The last refcount is decremented in tcp_sk_exit_batch() */
> + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> + continue;
> +
> inet_twsk_purge(&tcp_hashinfo, family);
> purged_once = true;
> }
> --
> 2.30.2
>
Powered by blists - more mailing lists