lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <XNM1$2$0$3$$2$8$4$A$2004314U46640bd2@hitachi.com> Date: Mon, 4 Jun 2007 21:55:55 +0900 From: noboru.obata.ar@...achi.com To: <netdev@...r.kernel.org> Subject: [RFC] Failover-friendly TCP retransmission Hi all, I would like to hear comments on how TCP retransmission can be done better on failover-capable network devices, such as an active-backup bonding device. Premise ======= Please note first that I want to address physical failures by the failover-capable network devices, which are increasingly becoming important as Xen-based VM systems are getting popular. Reducing a single-point-of-failure (physical device) is vital on such VM systems. And the failover here is not going to address overloaded or congested networks here, which should be solved separately. Background ========== When designing a TCP/IP based network system on failover-capable network devices, people want to set timeouts hierarchically in three layers, network device layer, TCP layer, and application layer (bottom-up order), such that: 1. Network device layer detects a failure first and switch to a backup device (say, in 20sec). 2. TCP layer timeout & retransmission comes next, _hopefully_ before the application layer timeout. 3. Application layer detects a network failure last (by, say, 30sec timeout) and may trigger a system-level failover. It should be noted that the timeouts for #1 and #2 are handled independently and there is no relationship between them. Also note that the actual timeout settings (20sec or 30sec in this example) are often determined by systems requirement and so setting them to certain "safe values" (if any) are usually not possible. Problem ======= If TCP retransmission misses the time frame between event #1 and #3 in Background above (between 20 and 30sec since network failure), a failure causes the system-level failover where the network-device-level failover should be enough. The problem in this hierarchical timeout scheme is that TCP layer does not guarantee the next retransmission to occur in certain period of time. In the above example, people expect TCP to retransmit a packet between 20 and 30sec since network failure, but it may not happen. Starting from RTO=0.5sec for example, retransmission will occur at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o' in the following diagram, but miss the time frame between time 20 and 30. time: 0 10 20 30sec | | | | App. layer |---------+---------+---------X ==> system failover TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30 Netdev layer |---------+---------X ==> network failover Solution ======== It seems reasonable for me to solve this problem by capping TCP_RTO_MAX, i.e., making TCP_RTO_MAX a sysctl variable and set it to a small number. In this example, setting to (10 - RTT)[sec] at most should work because retransmission will take place between time 20 and 30. My rationale follows. * This solution is simple and so less error-prone. * The solution does not violate RFC 2988 in maximum RTO value, because RFC 2988's requirement in maximum RTO value, "at least 60 seconds (in (2.5))," is OPTIONAL. * The solution adds a system-wide setting, which is preferable to per-socket setting (by, say, setsockopt or something), because all application benefits from the solution. Before posting patches, I would like to hear comments here. Any comments or suggestions, to make TCP retransmission work better on failover-capable network devices, are welcome. Regards, -- OBATA Noboru (noboru.obata.ar@...achi.com)
Powered by blists - more mailing lists