netdev - Re: TCPBacklogDrops during aggressive bursts of traffic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1337093776.8512.1089.camel@edumazet-glaptop>
Date:	Tue, 15 May 2012 16:56:16 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Kieran Mansley <kmansley@...arflare.com>
Cc:	netdev@...r.kernel.org
Subject: Re: TCPBacklogDrops during aggressive bursts of traffic

On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result).  The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce.  I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
> 
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call.  During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending.  The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
> 
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer.  I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
> 
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application.  It will have to copy the data
> to application address space etc.  At this point there is a switch in
> tcp_v4_rcv():
> 
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
> 
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
> 
> 1739        else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740                bh_unlock_sock(sk);
> 1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742                goto discard_and_relse;
> 1743        }
> 
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop.  This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
> 
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it.  It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.  
> 
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window).  It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem.  I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
> 
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog.  On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
> 
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help.  As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.


Please try latest kernels, this is probably 'fixed'

What network driver are you using ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html