lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+vQKAwhZQMLMJO_ziR5gcsDTbhf91-k0aYNg4exTDHVg@mail.gmail.com>
Date: Thu, 23 May 2024 11:36:54 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Jason Xing <kerneljasonxing@...il.com>
Cc: dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com, 
	davem@...emloft.net, kuniyu@...zon.com, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>, 
	syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
Subject: Re: [PATCH v2 net] tcp: fix a race when purging the netns and
 allocating tw socket

On Thu, May 23, 2024 at 11:24 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Thu, May 23, 2024 at 3:19 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, May 23, 2024 at 7:04 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@...cent.com>
> > >
> > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > >
> > > Race condition:
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            mod_timer(tw_timer...
> > > del_timer_sync
> > > inet_twsk_kill
> > > refcount_dec(tw_refcount)[1]
> > >                            refcount_inc(tw_refcount)[2]
> > >
> > > Race case happens because [1] decrements refcount first before [2].
> > >
> > > After we reorder the mod_timer() and refcount_inc() in the initialization
> > > phase, we can use the status of timer as an indicator to test if we want
> > > to destroy the tw socket in inet_twsk_purge() or postpone it to
> > > tw_timer_handler().
> > >
> > > After this patch applied, we get four possible cases:
> > > 1) if we can see the armed timer during the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > > test if the timer is queued
> > > //timer is queued
> > > del_timer_sync
> > > inet_twsk_kill
> > > refcount_dec(tw_refcount)
> > > Note: we finish it up in the purge process.
> > >
> > > 2) if we fail to see the armed timer during the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > > inet_twsk_purge            tcp_time_wait
> > > inet_twsk_deschedule_put   __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > > test if the timer is queued
> > > //timer isn't queued
> > > postpone
> > >                            mod_timer(tw_timer...
> > > Note: later, in another context, expired timer will finish up tw socket
> > >
> > > 3) if we're seeing a running timer after the initialization phase
> > >    CPU 0                      CPU 1                    CPU 2
> > >    -----                      -----                    -----
> > >                            tcp_time_wait
> > >                            __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > >                            ...(around 1 min)...
> > > inet_twsk_purge
> > > inet_twsk_deschedule_put
> > > test if the timer is queued
> > > // timer is running
> > > skip                                              tw_timer_handler
> > > Note: CPU 2 is destroying the timewait socket
> > >
> > > 4) if we're seeing a pending timer after the initialization phase
> > >    CPU 0                      CPU 1
> > >    -----                      -----
> > >                            tcp_time_wait
> > >                            __inet_twsk_schedule
> > >                            refcount_inc(tw_refcount)
> > >                            mod_timer(tw_timer...
> > >                            ...(< 1 min)...
> > > inet_twsk_purge
> > > inet_twsk_deschedule_put
> > > test if the timer is queued
> > > // timer is queued
> > > del_timer_sync
> > > inet_twsk_kill
> > >
> > > Therefore, only making sure that we either call inet_twsk_purge() or
> > > call tw_timer_handler() to destroy the timewait socket, we can
> > > handle all the cases as above.
> > >
> > > [1]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > Call Trace:
> > >  <TASK>
> > >  __refcount_dec include/linux/refcount.h:336 [inline]
> > >  refcount_dec include/linux/refcount.h:351 [inline]
> > >  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > >  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > >  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > >  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > >  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > >  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > >  setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > >  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > >  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > >  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > >  ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > >  __do_sys_unshare kernel/fork.c:3394 [inline]
> > >  __se_sys_unshare kernel/fork.c:3392 [inline]
> > >  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7f56d7c7cee9
> > >
> > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > Reported-by: syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
> > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/all/20240521144930.23805-1-kerneljasonxing@gmail.com/
> > > 1. Use timer as a flag to test if we can safely destroy the timewait socket
> > > based on top of the patch Eric wrote.
> > > 2. change the title and add more explanation into body message.
> > > ---
> > >  net/ipv4/inet_timewait_sock.c | 11 +++++++++--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index e28075f0006e..b890d1c280a1 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
> > >
> > >                 __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > >                                                      LINUX_MIB_TIMEWAITED);
> > > -               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >                 refcount_inc(&tw->tw_dr->tw_refcount);
> > > +               BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > >         } else {
> > >                 mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > >         }
> > > @@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> > >                         rcu_read_unlock();
> > >                         local_bh_disable();
> > >                         if (state == TCP_TIME_WAIT) {
> > > -                               inet_twsk_deschedule_put(inet_twsk(sk));
> > > +                               struct inet_timewait_sock *tw = inet_twsk(sk);
> > > +
> > > +                               /* If the timer is armed, we can safely destroy
> > > +                                * it, or else we postpone the process of destruction
> > > +                                * to tw_timer_handler().
> > > +                                */
> > > +                               if (timer_pending(&tw->tw_timer))
> > > +                                       inet_twsk_deschedule_put(tw);
> >
> >
> > This patch is not needed, and a timer_pending() would be racy anywau.
> >
> > As already explained, del_timer_sync() takes care of this with proper locking.
> >
> > inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > {
> >   if (del_timer_sync(&tw->tw_timer))
> >      inet_twsk_kill(tw);
> >   inet_twsk_put(tw);
> > }
>
> Sorry, I'm lost. But now I understand what I wrote in the morning is
> not correct...
>
> After rethinking about this, only reordering the mod_timer() and
> refcount_inc() in the initialization phase (just like the patch you
> wrote) can ensure safety.
>
> CPU 0 uses del_timer_sync() to detect if the timer is running while
> CPU 1 mod_timer() and then increments the refcount.
>
> 1) If CPU 0 sees the queued timer which means the tw socket has done
> incrementing refcount, then CPU 0 can kill (inet_twsk_kill()) the
> socket.
>
> 2) If CPU 0 cannot see the queued timer (and will not kill the
> socket), then CPU 1 will call mod_timer() and expire in one minute.
> Another cpu will call inet_twsk_kill() to clear tw socket at last.
>
> The patch you provided seems to solve the race issue in the right
> way... I cannot see the problem of it :(
>
> So what is the problem to be solved with your patch? Thanks for your
> patience and help.

I already explained that when we do the mod_timer() and
refcount_inc(&tw->tw_dr->tw_refcount),
the tw refcount is 0, and no other cpu can possibly use the tw.

So the order of these operations seems not really relevant.

Another cpu, before trying to use it would have first to acquire a
reference on tw refcount,
and this is not allowed/possible.

inet_twsk_purge() would hit this at this failed attempt.

if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
     continue;

I think we should wait for more syzbot occurrences before spending
more time on this.

It is possible the single report was caused by unrelated memory
mangling in the kernel,
or some more fundamental issues caused by kernel listeners and sockets.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ