[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 03 Mar 2010 07:54:29 +0100
From: Eric Dumazet <eric.dumazet@...il.com>
To: Zhu Yi <yi.zhu@...el.com>
Cc: netdev@...r.kernel.org, David Miller <davem@...emloft.net>,
Arnaldo Carvalho de Melo <acme@...stprotocols.net>,
"Pekka Savola (ipv6)" <pekkas@...core.fi>,
Patrick McHardy <kaber@...sh.net>,
Vlad Yasevich <vladislav.yasevich@...com>,
Sridhar Samudrala <sri@...ibm.com>,
Per Liden <per.liden@...csson.com>,
Jon Maloy <jon.maloy@...csson.com>,
Allan Stephens <allan.stephens@...driver.com>,
Andrew Hendry <andrew.hendry@...il.com>
Subject: Re: [PATCH 1/8] net: add limit for socket backlog
Le mercredi 03 mars 2010 à 14:35 +0800, Zhu Yi a écrit :
> We got system OOM while running some UDP netperf testing on the loopback
> device. The case is multiple senders sent stream UDP packets to a single
> receiver via loopback on local host. Of course, the receiver is not able
> to handle all the packets in time. But we surprisingly found that these
> packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
> Instead, they are kept queuing to sk->sk_backlog and finally ate up all
> the memory. We believe this is a secure hole that a none privileged user
> can crash the system.
>
> The root cause for this problem is, when the receiver is doing
> __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
> skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
> sk_receive_queue with the softirq enabled. In the above case, multiple
> busy senders will almost make it an endless loop. The skbs in the
> backlog end up eat all the system memory.
>
> The issue is not only for UDP. Any protocols using socket backlog is
> potentially affected. The patch adds limit for socket backlog so that
> the backlog size cannot be expanded endlessly.
>
> Reported-by: Alex Shi <alex.shi@...el.com>
> Cc: David Miller <davem@...emloft.net>
> Cc: Arnaldo Carvalho de Melo <acme@...stprotocols.net>
> Cc: Alexey Kuznetsov <kuznet@....inr.ac.ru
> Cc: "Pekka Savola (ipv6)" <pekkas@...core.fi>
> Cc: Patrick McHardy <kaber@...sh.net>
> Cc: Vlad Yasevich <vladislav.yasevich@...com>
> Cc: Sridhar Samudrala <sri@...ibm.com>
> Cc: Per Liden <per.liden@...csson.com>
> Cc: Jon Maloy <jon.maloy@...csson.com>
> Cc: Allan Stephens <allan.stephens@...driver.com>
> Cc: Andrew Hendry <andrew.hendry@...il.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@...il.com>
> Signed-off-by: Zhu Yi <yi.zhu@...el.com>
Your SOB should be before mine, you are the main author of this patch, I
am a contributor
> ---
> include/net/sock.h | 17 +++++++++++++++--
> net/core/sock.c | 15 +++++++++++++--
> 2 files changed, 28 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 6cb1676..847119a 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -253,6 +253,8 @@ struct sock {
> struct {
> struct sk_buff *head;
> struct sk_buff *tail;
> + int len;
> + int limit;
This new limit field is really not needed
> } sk_backlog;
> wait_queue_head_t *sk_sleep;
> struct dst_entry *sk_dst_cache;
> @@ -589,8 +591,8 @@ static inline int sk_stream_memory_free(struct sock *sk)
> return sk->sk_wmem_queued < sk->sk_sndbuf;
> }
>
> -/* The per-socket spinlock must be held here. */
> -static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb)
> +/* OOB backlog add */
> +static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
> {
> if (!sk->sk_backlog.tail) {
> sk->sk_backlog.head = sk->sk_backlog.tail = skb;
> @@ -601,6 +603,17 @@ static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb)
> skb->next = NULL;
> }
>
> +/* The per-socket spinlock must be held here. */
> +static inline int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
> +{
> + if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf >> 1))
> + return -ENOBUFS;
> +
> + __sk_add_backlog(sk, skb);
> + sk->sk_backlog.len += skb->truesize;
> + return 0;
> +}
> +
Ouch, this patch breaks bisection, since all protocols currently ignore
-ENOBUFS value and dont free skb
If you split your work on several patches, you still have to make
resulting kernels usable.
> static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
> {
> return sk->sk_backlog_rcv(sk, skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index fcd397a..fa042bc 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -340,8 +340,12 @@ int sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested)
> rc = sk_backlog_rcv(sk, skb);
>
> mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
> - } else
> - sk_add_backlog(sk, skb);
> + } else if (sk_add_backlog(sk, skb)) {
> + bh_unlock_sock(sk);
> + atomic_inc(&sk->sk_drops);
> + goto discard_and_relse;
> + }
> +
> bh_unlock_sock(sk);
> out:
> sock_put(sk);
> @@ -1139,6 +1142,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
> sock_lock_init(newsk);
> bh_lock_sock(newsk);
> newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL;
> + newsk->sk_backlog.len = 0;
>
> atomic_set(&newsk->sk_rmem_alloc, 0);
> /*
> @@ -1542,6 +1546,12 @@ static void __release_sock(struct sock *sk)
>
> bh_lock_sock(sk);
> } while ((skb = sk->sk_backlog.head) != NULL);
> +
> + /*
> + * Doing the zeroing here guarantee we can not loop forever
> + * while a wild producer attempts to flood us.
> + */
> + sk->sk_backlog.len = 0;
> }
>
> /**
> @@ -1874,6 +1884,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
> sk->sk_allocation = GFP_KERNEL;
> sk->sk_rcvbuf = sysctl_rmem_default;
> sk->sk_sndbuf = sysctl_wmem_default;
> + sk->sk_backlog.limit = sk->sk_rcvbuf >> 1;
Didnt we agreed to use sk_>rcvbuf << 1 in previous round ?
> sk->sk_state = TCP_CLOSE;
> sk_set_socket(sk, sock);
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists