[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoC+QhOcZjADhLjhdHoQczLTbruefAevy1+d8Tp+xMvi0Q@mail.gmail.com>
Date: Wed, 22 May 2024 23:37:26 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
davem@...emloft.net, kuniyu@...zon.com, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>,
syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
Subject: Re: [PATCH net] Revert "rds: tcp: Fix use-after-free of net in reqsk_timer_handler()."
On Wed, May 22, 2024 at 6:54 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Wed, May 22, 2024 at 12:42 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Wed, May 22, 2024 at 2:51 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > >
> > > On Wed, May 22, 2024 at 8:28 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > Hello Eric,
> > > >
> > > > On Tue, May 21, 2024 at 11:49 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > > > >
> > > > > On Tue, May 21, 2024 at 4:49 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > >
> > > > > > This reverts commit 2a750d6a5b365265dbda33330a6188547ddb5c24.
> > > > > >
> > > > > > Syzbot[1] reported the drecrement of reference count hits leaking memory.
> > > > > >
> > > > > > If we failed in setup_net() and try to undo the setup process, the
> > > > > > reference now is 1 which shouldn't be decremented. However, it happened
> > > > > > actually.
> > > > > >
> > > > > > After applying this patch which allows us to check the reference first,
> > > > > > it will not hit zero anymore in tcp_twsk_purge() without calling
> > > > > > inet_twsk_purge() one more time.
> > > > > >
> > > > > > [1]
> > > > > > refcount_t: decrement hit 0; leaking memory.
> > > > > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > > Modules linked in:
> > > > > > CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> > > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > > > > > RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > > > > > RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> > > > > > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> > > > > > RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> > > > > > RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> > > > > > R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> > > > > > R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> > > > > > FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> > > > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > __refcount_dec include/linux/refcount.h:336 [inline]
> > > > > > refcount_dec include/linux/refcount.h:351 [inline]
> > > > > > inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> > > > > > inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> > > > > > inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> > > > > > tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> > > > > > tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> > > > > > ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> > > > > > setup_net+0x714/0xb40 net/core/net_namespace.c:375
> > > > > > copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> > > > > > create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> > > > > > unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> > > > > > ksys_unshare+0x419/0x970 kernel/fork.c:3323
> > > > > > __do_sys_unshare kernel/fork.c:3394 [inline]
> > > > > > __se_sys_unshare kernel/fork.c:3392 [inline]
> > > > > > __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> > > > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> > > > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > RIP: 0033:0x7f56d7c7cee9
> > > > > >
> > > > > > Fixes: 2a750d6a5b36 ("rds: tcp: Fix use-after-free of net in reqsk_timer_handler().")
> > > > > > Reported-by: syzbot+2eca27bdcb48ed330251@...kaller.appspotmail.com
> > > > > > Closes: https://syzkaller.appspot.com/bug?extid=2eca27bdcb48ed330251
> > > > > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > > > > ---
> > > > > > The reverted patch trying to solve another issue causes unexpected error as above. I
> > > > > > think that issue can be properly analyzed and handled later. So can we revert it first?
> > > > > > ---
> > > > > > net/ipv4/tcp_minisocks.c | 4 ++++
> > > > > > 1 file changed, 4 insertions(+)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > > index b93619b2384b..46e6f9db4227 100644
> > > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > > @@ -399,6 +399,10 @@ void tcp_twsk_purge(struct list_head *net_exit_list)
> > > > > > /* Even if tw_refcount == 1, we must clean up kernel reqsk */
> > > > > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
> > > > > > } else if (!purged_once) {
> > > > > > + /* The last refcount is decremented in tcp_sk_exit_batch() */
> > > > > > + if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
> > > > > > + continue;
> > > > > > +
> > > > > > inet_twsk_purge(&tcp_hashinfo);
> > > > > > purged_once = true;
> > > > > > }
> > > > > > --
> > > > >
> > > > > This can not be a fix for a race condition.
> > > > >
> > > > > By definition a TW has a refcount on tcp_death_row.tw_refcount only
> > > > > if its timer is armed.
> > > > >
> > > > > And inet_twsk_deschedule_put() does
> > > > >
> > > > > if (del_timer_sync(&tw->tw_timer))
> > > > > inet_twsk_kill(tw);
> > > > >
> > > > > I think you need to provide a full explanation, instead of a shot in the dark.
> > > > >
> > > > > Before releasing this syzbot, I thought that maybe the refcount_inc()
> > > > > was done too late,
> > > > > but considering other invariants, this should not matter.
> > > > >
> > > > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > > > index e28075f0006e333897ad379ebc8c87fc3f9643bd..d8f4d93c45331be64d70af96de33f783870e1dcc
> > > > > 100644
> > > > > --- a/net/ipv4/inet_timewait_sock.c
> > > > > +++ b/net/ipv4/inet_timewait_sock.c
> > > > > @@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
> > > > > inet_timewait_sock *tw, int timeo, bool rearm)
> > > > >
> > > > > __NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
> > > > > LINUX_MIB_TIMEWAITED);
> > > > > - BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > > > refcount_inc(&tw->tw_dr->tw_refcount);
> > > > > + BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
> > > > > } else {
> > > > > mod_timer_pending(&tw->tw_timer, jiffies + timeo);
> > > > > }
> > > >
> > > > Thanks for your information.
> > > >
> > > > What you wrote is right, I think for a while. In
> > > > inet_twsk_deschedule_put(), there are two possible cases:
> > > > 1) if it finds the timer is armed, then it can kill the tw socket by
> > > > decrementing the refcount in the right way. So it's a good/lucky thing
> > > > for us.
> > > > or
> > > > 2) if it misses the point, then the tw socket arms the timer which
> > > > will expire in 60 seconds in the initialization phase. The tw socket
> > > > would have a chance to be destroyed when the timer expires.
> > > >
> > > > It seems that you don't think your code could solve the race issue?
> > >
> > > inet_twsk_schedule(tw) is called while the tw refcount is still 0, so
> > > this tw can be be found in inet_twsk_purge()
> > > at the same time. Look at inet_twsk_hashdance() for details.
> >
> > Yes, after inet_twsk_purge() finds the tw, there are two cases like my
> > previous email said after applying your code.
> >
> > For 1) case, everything is good. inet_twsk_purge() will finish it up
> > because it can decrement the refcount safely.
> >
> > For 2) case, even though inet_twsk_purge() finds it, it's not the time
> > to destroy it until the expired tw timer will finally handle the
> > process of destruction by calling inet_twsk_kill(), right? Let the
> > timer handle the destruction until its end of life, which I think is a
> > normal process for all the timewait sockets.
> >
> > The only difference in 2) case is that inet_twsk_purge() calls
> > inet_twsk_put() twice while tw_timer_handler() only calls it one time.
> >
> > Am I missing something?
>
> You are missing that inet_twsk_deschedule_put() is doing :
>
> if (del_timer_sync(&tw->tw_timer))
> inet_twsk_kill(tw);
>
> You can not have both tw_timer_handler() and
> inet_twsk_deschedule_put() calling inet_twsk_kill(tw);
[...]
>
> Take a look at del_timer_sync()
Oops, I missed this function...Thanks for pointing it out.
I can test if the timer is active or queued before we can delete, like
this on top of your patch:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e28075f0006e..b890d1c280a1 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -255,8 +255,8 @@ void __inet_twsk_schedule(struct
inet_timewait_sock *tw, int timeo, bool rearm)
__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
LINUX_MIB_TIMEWAITED);
- BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
refcount_inc(&tw->tw_dr->tw_refcount);
+ BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
} else {
mod_timer_pending(&tw->tw_timer, jiffies + timeo);
}
@@ -301,7 +301,14 @@ void inet_twsk_purge(struct inet_hashinfo *hashinfo)
rcu_read_unlock();
local_bh_disable();
if (state == TCP_TIME_WAIT) {
- inet_twsk_deschedule_put(inet_twsk(sk));
+ struct inet_timewait_sock *tw = inet_twsk(sk);
+
+ /* If the timer is armed, we can safely destroy
+ * it, or else we postpone the process
of destruction
+ * to tw_timer_handler().
+ */
+ if (timer_pending(&tw->tw_timer))
+ inet_twsk_deschedule_put(tw);
} else {
struct request_sock *req = inet_reqsk(sk);
As above, if we find the timer is armed, then we can destroy it
naturally in inet_twsk_deschedule_put(). If not, we can skip it in the
inet_twsk_purge() and then postpone the destruction process in the
timer handler.
I think I need to add another test about whether the timer is running or not.
If testing the status of timer is not a good way to go, perhaps I
would like to introduce a new flag to indicate whether the tw socket
still exists in the initialization phase.
Sorry, It's __not__ tested because it's too late for me, I will spend
more time taking care of this race condition tomorrow.
Thanks,
Jason
Powered by blists - more mailing lists