netdev - Fw: [Bug 94991] New: TCP bug creates additional RTO in very specific condition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150317083333.65b7af40@urahara>
Date:	Tue, 17 Mar 2015 08:33:33 -0700
From:	Stephen Hemminger <stephen@...workplumber.org>
To:	netdev@...r.kernel.org
Subject: Fw: [Bug 94991] New: TCP bug creates additional RTO in very
 specific condition



Begin forwarded message:

Date: Tue, 17 Mar 2015 13:20:34 +0000
From: "bugzilla-daemon@...zilla.kernel.org" <bugzilla-daemon@...zilla.kernel.org>
To: "shemminger@...ux-foundation.org" <shemminger@...ux-foundation.org>
Subject: [Bug 94991] New: TCP bug creates additional RTO in very specific condition


https://bugzilla.kernel.org/show_bug.cgi?id=94991

            Bug ID: 94991
           Summary: TCP bug creates additional RTO in very specific
                    condition
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.32-504.3.3.el6.x86_64
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
          Assignee: shemminger@...ux-foundation.org
          Reporter: matiasb@...il.com
        Regression: No

Created attachment 170931
  --> https://bugzilla.kernel.org/attachment.cgi?id=170931&action=edit
server tcpdump showing the bug

Hi all,

We found an unexpected behavior in the applications which we are using that
appears to be a bug in the TCP algorithm. Following tcpdumps we detected that a
second unnecessary retransmission timeout (RTO) occurs after a first valid RTO.
This occurs only in a very particular situation: when there are 2 packets
pending to be sent at the moment the first retransmission occurs (and also in
the context of our application which operates in a very low latency network,
and which its only traffic is receiving a single request message and sending a
~12KB response). This situation occurs very often in our context and because
the application operates at very low latency a RTO impacts severely the
performance.
More details of the application communication and tcpdumps with the bug
explanation is copied below.

Is this a proper bug or is there something we are missing expected TCP
behavoir? Also, is there is this a known way to avoid this unexpected
behaviour?

Thank you very much,
Regards,
Matias

Environment
OS: SLC 6
 $ uname -a
Linux <pcName> 2.6.32-504.3.3.el6.x86_64 #1 SMP Wed Dec 17 09:22:39 CET 2014
x86_64 x86_64 x86_64 GNU/Linux

TCP configuration: full configuration from  /proc/sys/net/ipv4/* pasted at the
bottom.
tcp_base_mss=512                           
tcp_congestion_control=cubic   
tcp_sack=1


Application communication description:  We are using 2 applications which
communicate using TCP: a client and around 200 server application.  The client
sends a request message of 188B (including headers) to all servers and waits
for a response of all of them. The client does not send any other message until
the response of all servers is received. Each server upon receiving the
request, it sends a 12KB response (which is obviously splitted into several TCP
packets).  Because there are 200 servers responding at almost the same moment
(with a total of ~2.4MB) some buffers in the network may overflow generating
drops and retransmissions.
When there are no drops (thanks to control application that limits the requests
sent) the latency to receive all messages from all servers is ~20ms. If there
is a drop of one or more TCP fragments then the latency goes to near ~200ms
(this is because of the minimum RTO of 200ms hardcoded in the kernel). Even if
this is 10 times higher it is more or less under acceptable for the
application. The bug creates a second consecutive retransmission so the latency
when this occurs goes to 600ms (200ms of the first RTO + 400ms of the second
unexpected RTO), which is out of the limits that the application can handle (60
times higher).

Bug detailed description:
The unexpected behavior appears in the server applications when TCP needs to
retransmit drops packets. It appears in all server applications at a quite a
high frequency.
The bug appears only when the server detected a drop (by a RTO after 200ms) and
at that moment it is still pending to receive the ACK for 2 packets. In that
case, after 200ms of sending all packets, the RTO triggers the retransmission
of the first packet, then the ACK for that packet is received but the second
packet is not retransmitted at that moment. After another 400ms another RTO is
triggered and that second packet is retransmitted and ACKed. To our
understanding this second retransmission should not occur. The expected
behaviour is that the second packet is retransmitted right after receiving the
ACK for the first retransmitted packet.
Also this unexpected second RTO occurs only if there are 2 pending packets at
the moment of the first RTO.  If  there is one packet to retransmit for more
than 2, the behaviour is as expected, all packets are retransmitted and ACKed
after the first RTO (there is no second RTO). 

Below the explanation and a section of a tcpdump recorded in one of the server
applications showing the unexpected behaviour. 

Frame #170:  request is received (at T 0)
Frames #171-#173: response is sent splitted into several TCP packets. From
seq=204273 to seq=216289.
Frame #171 and #172 are recorded by tcpdump as a single packet but is probably
several real packets as the MSS is 1460 bytes and it shows a lenght higher than
that (this is probably caused because NICs support segmentation offloads, which
means, that the NIC joins the segments together and pushes it to the host’s TCP
stack a single segment. This is why tcpdump sees it as a segment of higher
length).
Frames #173-#177: ACKs for some of the sent packets is received. Last seq
acknowledged is seq=442797 (there is still 1796 bytes to be sent, which is 2
TCP packets).
Frame #178: At T 207ms a packet is retransmitted. This is the first
retransmission, which makes total sense as the ACKs for 2 packets were not
received after 200ms. Because of the RTO the TCP internal state should be
updated to duplicate the RTO (so it should be 400ms now). Also the CWND should
be reduced to 1.
Frame #179: ACK for the retransmitted packet is received. 
The internal state of TCP should be update to duplicate the CWND because of
slow start (so should be set now to 2). RTO is not updated because calculation
of RTO is based only in packets which were not retransmitted.
At this point we would expect that the pending packet should be retransmitted,
but this does not occur. After receiving an ACK the CWDN should allow more
packets to be sent, but there is no data sent by the server (and consequently
it receives nothing).
Frame #180: at T 613ms (aprox ~400ms after the last received ACK) the last
packet is retransmitted.
This is what creates a 600ms latency which is 60 times the expected and 6 times
higher if the bug would not be present.
Frame #181: ACK for the last packet is received.
Frame #182: a new request is received..


No.  Time      Source    Destination   Protocol RTO        Length Info
170 *REF*       DCM          ROS         TCP            118    47997 > 41418
[PSH, ACK] Seq=1089 Ack=204273 Win=10757 Len=64
171 0.000073    ROS         DCM          TCP            5894   41418 > 47997
[ACK] Seq=204273 Ack=1153 Win=58 Len=5840
172 0.000080    ROS         DCM          TCP            5894   41418 > 47997
[ACK] Seq=210113 Ack=1153 Win=58 Len=5840
173 0.000083    ROS         DCM          TCP            390    41418 > 47997
[PSH, ACK] Seq=215953 Ack=1153 Win=58 Len=336[Packet size limited during
capture]
174 0.003901    DCM          ROS         TCP           60     47997 > 41418
[ACK] Seq=1153 Ack=207193 Win=10757 Len=0
175 0.004270    DCM          ROS         TCP            60     47997 > 41418
[ACK] Seq=1153 Ack=211573 Win=10768 Len=0
176 0.004649    DCM          ROS         TCP            60     47997 > 41418
[ACK] Seq=1153 Ack=213033 Win=10768 Len=0
177 0.004835    DCM          ROS         TCP            66     [TCP Dup ACK
176#1] 47997 > 41418 [ACK] Seq=1153 Ack=213033 Win=10768 Len=0 SLE=214493
SRE=215953
178 0.207472    ROS         DCM          TCP    0.207389000  1514   [TCP
Retransmission] 41418 > 47997 [ACK] Seq=213033 Ack=1153 Win=58 Len=1460
179 0.207609    DCM          ROS         TCP            60     47997 > 41418
[ACK] Seq=1153 Ack=215953 Win=10768 Len=0
180 0.613472    ROS         DCM          TCP    0.613389000  390    [TCP
Retransmission] 41418 > 47997 [PSH, ACK] Seq=215953 Ack=1153 Win=58
Len=336[Packet size limited during capture]
181 0.613622    DCM          ROS         TCP            60     47997 > 41418
[ACK] Seq=1153 Ack=216289 Win=10768 Len=0
182 0.615189    DCM          ROS         TCP            118    47997 > 41418
[PSH, ACK] Seq=1153 Ack=216289 Win=10768 Len=64






Full TCP configuration: for f in /proc/sys/net/ipv4/* ;do confName=$(basename
"$f") ; echo -n "$confName="  >> /logs/tpu_TCP_config.txt ; cat "$f" >>
/logs/tpu_TCP_config.txt ;done

cipso_cache_bucket_size=10
cipso_cache_enable=1      
cipso_rbm_optfmt=0        
cipso_rbm_strictvalid=1   
icmp_echo_ignore_all=0
icmp_echo_ignore_broadcasts=1
icmp_errors_use_inbound_ifaddr=0
icmp_ignore_bogus_error_responses=1
icmp_ratelimit=1000                
icmp_ratemask=6168                 
igmp_max_memberships=20            
igmp_max_msf=10                    
inet_peer_gc_maxtime=120           
inet_peer_gc_mintime=10            
inet_peer_maxttl=600               
inet_peer_minttl=120               
inet_peer_threshold=65664          
ip_default_ttl=64                  
ip_dynaddr=0                       
ip_forward=0                       
ipfrag_high_thresh=262144          
ipfrag_low_thresh=196608           
ipfrag_max_dist=64                 
ipfrag_secret_interval=600         
ipfrag_time=30                     
ip_local_port_range=32768       61000
ip_local_reserved_ports=             
ip_nonlocal_bind=0                   
ip_no_pmtu_disc=0                    
ping_group_range=1        0    
rt_cache_rebuild_count=4       
tcp_abc=0                            
tcp_abort_on_overflow=0              
tcp_adv_win_scale=2                  
tcp_allowed_congestion_control=cubic reno
tcp_app_win=31                           
tcp_available_congestion_control=cubic reno
tcp_base_mss=512                           
tcp_challenge_ack_limit=100                
tcp_congestion_control=cubic               
tcp_dma_copybreak=262144                   
tcp_dsack=1                                
tcp_ecn=2                                  
tcp_fack=1                                 
tcp_fin_timeout=60                         
tcp_frto=2                                 
tcp_frto_response=0                        
tcp_keepalive_intvl=75                     
tcp_keepalive_probes=9                     
tcp_keepalive_time=7200                    
tcp_limit_output_bytes=131072              
tcp_low_latency=0                          
tcp_max_orphans=262144                     
tcp_max_ssthresh=0                         
tcp_max_syn_backlog=2048                   
tcp_max_tw_buckets=262144                  
tcp_mem=2316864 3089152 4633728            
tcp_min_tso_segs=2                         
tcp_moderate_rcvbuf=1                      
tcp_mtu_probing=0                          
tcp_no_metrics_save=0                      
tcp_orphan_retries=0                       
tcp_reordering=3                           
tcp_retrans_collapse=1                     
tcp_retries1=3                             
tcp_retries2=15                            
tcp_rfc1337=0                              
tcp_rmem=4096   87380   4194304
tcp_sack=1
tcp_slow_start_after_idle=0
tcp_stdurg=0
tcp_synack_retries=5
tcp_syncookies=1
tcp_syn_retries=5
tcp_thin_dupack=0
tcp_thin_linear_timeouts=0
tcp_timestamps=0
tcp_tso_win_divisor=3
tcp_tw_recycle=0
tcp_tw_reuse=0
tcp_window_scaling=1
tcp_wmem=4096   65536   4194304
tcp_workaround_signed_windows=0
udp_mem=2316864 3089152 4633728
udp_rmem_min=4096
udp_wmem_min=4096
xfrm4_gc_thresh=4194304

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html