lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoC0wHy=qRgNd8bpRNff1gA6eZ_Mv0ba9b2Ce9hhVOqN_g@mail.gmail.com>
Date: Thu, 23 May 2024 20:15:52 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com, 
	davem@...emloft.net, kuniyu@...zon.com, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>, 
	syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
Subject: Re: [PATCH v2 net] tcp: fix a race when purging the netns and
 allocating tw socket

On Thu, May 23, 2024 at 5:37 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > From: Jason Xing <kernelxing@...cent.com>
> > > >
> > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > >
> > > > Race condition:
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            mod_timer(tw_timer...
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > > refcount_dec(tw_refcount)[1]
> > > >                            refcount_inc(tw_refcount)[2]
> > > >
> > > > Race case happens because [1] decrements refcount first before [2].
> > > >
> > > > After we reorder the mod_timer() and refcount_inc() in the initialization
> > > > phase, we can use the status of timer as an indicator to test if we want
> > > > to destroy the tw socket in inet_twsk_purge() or postpone it to
> > > > tw_timer_handler().
> > > >
> > > > After this patch applied, we get four possible cases:
> > > > 1) if we can see the armed timer during the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > > test if the timer is queued
> > > > //timer is queued
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > > refcount_dec(tw_refcount)
> > > > Note: we finish it up in the purge process.
> > > >
> > > > 2) if we fail to see the armed timer during the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > > inet_twsk_purge            tcp_time_wait
> > > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > > test if the timer is queued
> > > > //timer isn't queued
> > > > postpone
> > > >                            mod_timer(tw_timer...
> > > > Note: later, in another context, expired timer will finish up tw socket
> > > >
> > > > 3) if we're seeing a running timer after the initialization phase
> > > >    CPU 0                      CPU 1                    CPU 2
> > > >    -----                      -----                    -----
> > > >                            tcp_time_wait
> > > >                            __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > >                            ...(around 1 min)...
> > > > inet_twsk_purge
> > > > inet_twsk_deschedule_put
> > > > test if the timer is queued
> > > > // timer is running
> > > > skip                                              tw_timer_handler
> > > > Note: CPU 2 is destroying the timewait socket
> > > >
> > > > 4) if we're seeing a pending timer after the initialization phase
> > > >    CPU 0                      CPU 1
> > > >    -----                      -----
> > > >                            tcp_time_wait
> > > >                            __inet_twsk_schedule
> > > >                            refcount_inc(tw_refcount)
> > > >                            mod_timer(tw_timer...
> > > >                            ...(< 1 min)...
> > > > inet_twsk_purge
> > > > inet_twsk_deschedule_put
> > > > test if the timer is queued
> > > > // timer is queued
> > > > del_timer_sync
> > > > inet_twsk_kill
> > > >
> > > > Therefore, only making sure that we either call inet_twsk_purge() or
> > > > call tw_timer_handler() to destroy the timewait socket, we can
> > > > handle all the cases as above.
> > > >
> > > > [1]
> > > > refcount_t: decrement hit 0; leaking memory.
> > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > Modules linked in:
> > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > Call Trace:
> > > >  <TASK>
> > > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > > >  refcount_dec include/linux/refcount.h:351 [inline]
> > > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > RIP: 0033:0x7f56d7c7cee9
> > > >
> > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > Reported-by: syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
> > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > > ---
> > > > v2
> > > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> > > > 1. Use timer as a flag to test if we can safely destroy the timewait socket
> > > > based on top of the patch Eric wrote.
> > > > 2. change the title and add more explanation into body message.
> > > > ---
> > > >  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
> > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > index e28075f0006e..b890d1c280a1 100644
> > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> > > >
> > > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > > >                                                      LINUX_MIB_TIMEWAITED);
> > > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > >         } else {
> > > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > > >         }
> > > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> > > >                         rcu_read_unlock();
> > > >                         local_bh_disable();
> > > >                         if (state == TCP_TIME_WAIT) {
> > > > -                               inet_twsk_deschedule_put(inet_twsk(sk));
> > > > +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> > > > +
> > > > +                               /* If the timer is armed, we can safely destroy
> > > > +                                * it, or else we postpone the process of destruction
> > > > +                                * to tw_timer_handler().
> > > > +                                */
> > > > +                               if (timer_pending(&tw->tw_timer))
> > > > +                                       inet_twsk_deschedule_put(tw);
> > >
> > >
> > > This patch is not needed, and a timer_pending() would be racy anywau.
> > >
> > > As already explained, del_timer_sync() takes care of this with proper locking.
> > >
> > > inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > > {
> > >   if (del_timer_sync(&tw->tw_timer))
> > >      inet_twsk_kill(tw);
> > >   inet_twsk_put(tw);
> > > }
> >
> > Sorry, I'm lost. But now I understand what I wrote in the morning is
> > not correct...
> >
> > After rethinking about this, only reordering the mod_timer() and
> > refcount_inc() in the initialization phase (just like the patch you
> > wrote) can ensure safety.
> >
> > CPU 0 uses del_timer_sync() to detect if the timer is running while
> > CPU 1 mod_timer() and then increments the refcount.
> >
> > 1) If CPU 0 sees the queued timer which means the tw socket has done
> > incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the
> > socket.
> >
> > 2) If CPU 0 cannot see the queued timer (and will not kill the
> > socket), then CPU 1 will call mod_timer() and expire in one minute.
> > Another cpu will call inet_twsk_kill() to clear tw socket at last.
> >
> > The patch you provided seems to solve the race issue in the right
> > way... I cannot see the problem of it :(
> >
> > So what is the problem to be solved with your patch? Thanks for your
> > patience and help.
>
> I already explained that when we do the mod_timer() and
> refcount_inc(&tw->tw_dr->tw_refcount),
> the tw refcount is 0, and no other cpu can possibly use the tw.
>
> So the order of these operations seems not really relevant.
>
> Another cpu, before trying to use it would have first to acquire a
> reference on tw refcount,
> and this is not allowed/possible.
>
> inet_twsk_purge() would hit this at this failed attempt.
>
> if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
>      continue;
>
> I think we should wait for more syzbot occurrences before spending
> more time on this.
>
> It is possible the single report was caused by unrelated memory
> mangling in the kernel,
> or some more fundamental issues caused by kernel listeners and sockets.

Thank you so much, Eric!! I learn a lot.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ