netdev - Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <643ff7a7a551f_38347529475@willemb.c.googlers.com.notmuch>
Date:   Wed, 19 Apr 2023 10:16:07 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Eric Dumazet <edumazet@...gle.com>,
        Kuniyuki Iwashima <kuniyu@...zon.com>
Cc:     davem@...emloft.net, kuba@...nel.org, kuni1840@...il.com,
        netdev@...r.kernel.org, pabeni@...hat.com,
        syzkaller@...glegroups.com, willemb@...gle.com
Subject: Re: [PATCH v2 net] tcp/udp: Fix memleaks of sk and zerocopy skbs with
 TX timestamp.

Eric Dumazet wrote:
> On Tue, Apr 18, 2023 at 9:25 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> >
> > From:   Eric Dumazet <edumazet@...gle.com>
> > Date:   Tue, 18 Apr 2023 21:04:32 +0200
> > > On Tue, Apr 18, 2023 at 8:44 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > >
> > > > From:   Eric Dumazet <edumazet@...gle.com>
> > > > Date:   Tue, 18 Apr 2023 20:33:44 +0200
> > > > > On Tue, Apr 18, 2023 at 8:09 PM Kuniyuki Iwashima <kuniyu@...zon.com> wrote:
> > > > > >
> > > > > > syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
> > > > > > skbs.  We can reproduce the problem with these sequences:
> > > > > >
> > > > > >   sk = socket(AF_INET, SOCK_DGRAM, 0)
> > > > > >   sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
> > > > > >   sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
> > > > > >   sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
> > > > > >   sk.close()
> > > > > >
> > > > > > sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
> > > > > > skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
> > > > > > ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
> > > > > > skb is sent, __skb_tstamp_tx() clones it and puts the clone into
> > > > > > the socket's error queue with the TX timestamp.
> > > > > >
> > > > > > When the original skb is received locally, skb_copy_ubufs() calls
> > > > > > skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
> > > > > > This additional count is decremented while freeing the skb, but struct
> > > > > > ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
> > > > > > not called.
> > > > > >
> > > > > > The last refcnt is not released unless we retrieve the TX timestamped
> > > > > > skb by recvmsg().  When we close() the socket holding such skb, we
> > > > > > never call sock_put() and leak the count, skb, and sk.
> > > > > >
> > > > > > To avoid this problem, we must (i) call skb_queue_purge() after
> > > > > > flagging SOCK_DEAD during close() and (ii) make sure that TX tstamp
> > > > > > skb is not queued when SOCK_DEAD is flagged.  UDP lacks (i) and (ii),
> > > > > > and TCP lacks (ii).
> > > > > >
> > > > > > Without (ii), a skb queued in a qdisc or device could be put into
> > > > > > the error queue after skb_queue_purge().
> > > > > >
> > > > > >   sendmsg() /* return immediately, but packets
> > > > > >              * are queued in a qdisc or device
> > > > > >              */
> > > > > >                                     close()
> > > > > >                                       skb_queue_purge()
> > > > > >   __skb_tstamp_tx()
> > > > > >     __skb_complete_tx_timestamp()
> > > > > >       sock_queue_err_skb()
> > > > > >         skb_queue_tail()
> > > > > >
> > > > > > Also, we need to check SOCK_DEAD under sk->sk_error_queue.lock
> > > > > > in sock_queue_err_skb() to avoid this race.
> > > > > >
> > > > > >   if (!sock_flag(sk, SOCK_DEAD))
> > > > > >                                     sock_set_flag(sk, SOCK_DEAD)
> > > > > >                                     skb_queue_purge()
> > > > > >
> > > > > >     skb_queue_tail()
> > > > > >
> > > > > > [0]:
> > > > >
> > > > > > Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
> > > > > > Fixes: b5947e5d1e71 ("udp: msg_zerocopy")
> > > > > > Reported-by: syzbot <syzkaller@...glegroups.com>
> > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> > > > > > ---
> > > > > > v2:
> > > > > >   * Move skb_queue_purge() after setting SOCK_DEAD in udp_destroy_sock()
> > > > > >   * Check SOCK_DEAD in sock_queue_err_skb() with sk_error_queue.lock
> > > > > >   * Add Fixes tag for TCP
> > > > > >
> > > > > > v1: https://lore.kernel.org/netdev/20230417171155.22916-1-kuniyu@amazon.com/
> > > > > > ---
> > > > > >  net/core/skbuff.c | 15 ++++++++++++---
> > > > > >  net/ipv4/udp.c    |  5 +++++
> > > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > > index 4c0879798eb8..287b834df9c8 100644
> > > > > > --- a/net/core/skbuff.c
> > > > > > +++ b/net/core/skbuff.c
> > > > > > @@ -4979,6 +4979,8 @@ static void skb_set_err_queue(struct sk_buff *skb)
> > > > > >   */
> > > > > >  int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > >  {
> > > > > > +       unsigned long flags;
> > > > > > +
> > > > > >         if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
> > > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > > >                 return -ENOMEM;
> > > > > > @@ -4992,9 +4994,16 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
> > > > > >         /* before exiting rcu section, make sure dst is refcounted */
> > > > > >         skb_dst_force(skb);
> > > > > >
> > > > > > -       skb_queue_tail(&sk->sk_error_queue, skb);
> > > > > > -       if (!sock_flag(sk, SOCK_DEAD))
> > > > > > -               sk_error_report(sk);
> > > > > > +       spin_lock_irqsave(&sk->sk_error_queue.lock, flags);
> > > > > > +       if (sock_flag(sk, SOCK_DEAD)) {
> > > > >
> > > > > SOCK_DEAD is set without holding sk_error_queue.lock, so I wonder why you
> > > > > want to add a confusing construct.
> > > > >
> > > > > Just bail early ?
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index ef81452759be3fd251faaf76d89cfd002ee79256..fda05cb44f95821e98f8c5c05fba840a9d276abb
> > > > > 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -4983,6 +4983,9 @@ int sock_queue_err_skb(struct sock *sk, struct
> > > > > sk_buff *skb)
> > > > >             (unsigned int)READ_ONCE(sk->sk_rcvbuf))
> > > > >                 return -ENOMEM;
> > > > >
> > > > > +       if (sock_flag(sk, SOCK_DEAD))
> > > > > +               return -EINVAL;
> > > > > +
> > > >
> > > > Isn't it possible that these sequences happen
> > > >
> > > >   close()
> > > >     sock_set_flag(sk, SOCK_DEAD);
> > > >     skb_queue_purge(&sk->sk_error_queue)
> > > >
> > > > between the skb_queue_tail() below ? (2nd race mentioned in changelog)
> > > >
> > > > I thought we can guarantee the ordering by taking the same lock.
> > > >
> > >
> > > This is fragile.
> >
> > Yes, but I didn't have better idea to avoid the race...
> >
> > >
> > > We could very well rewrite skb_queue_purge() to not acquire the lock
> > > in the common case.
> > > I had the following in my tree for a while, to avoid many atomic and
> > > irq masking operations...
> >
> > Cool, and it still works with my patch, no ?
> >
> 
> 
> Really the only thing that ensures a race is not possible is the
> typical sk_refcnt acquisition.
> 
> But I do not see why an skb stored in error_queue should keep the
> refcnt on the socket.
> This seems like a chicken and egg problem, and caused various issues
> in the past,
> see for instance [1]
> 
> We better make sure error queue is purged at socket dismantle (after
> refcnt reached 0)

The problem here is that the timestamp queued on the error queue
holds a reference on a ubuf if MSG_ZEROCOPY and that ubuf holds an
sk_ref.

The timestamped packet may contain packet contents, so the ubuf
ref is not superfluous.

Come to think of it, we've always maintained that zerocopy packets
should not be looped to sockets where they can be queued indefinitely,
including packet sockets.

If we enforce that for these tx timestamps too, then that also
solves this issue.

A process that wants efficient MSG_ZEROCOPY will have to request
timestamping with SOF_TIMESTAMPING_OPT_TSONLY to avoid returning the
data along with the timestamp.

> I suggest we generalize fix that went in this patch :
> 
> dd4f10722aeb10f4f582948839f066bebe44e5fb net: fix socket refcounting
> in skb_complete_wifi_ack()
> 
> Instead of adding yet another rule about a queue lock, and a socket
> flag, this would keep things tied to sk_refcnt.
> 
> Also I do not think TCP has an issue, please take a look at
> 
> [1]
> commit e0c8bccd40fc1c19e1d246c39bcf79e357e1ada3    net: stream: purge
> sk_error_queue in sk_stream_kill_queues()
> 
> (A leak in TCP would be caught in a few hours by syzbot really)