[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <XNM1$2$0$3$$2$8$4$A$2004314U46640bd2@hitachi.com>
Date: Mon, 4 Jun 2007 21:55:55 +0900
From: noboru.obata.ar@...achi.com
To: <netdev@...r.kernel.org>
Subject: [RFC] Failover-friendly TCP retransmission
Hi all,
I would like to hear comments on how TCP retransmission can be
done better on failover-capable network devices, such as an
active-backup bonding device.
Premise
=======
Please note first that I want to address physical failures by
the failover-capable network devices, which are increasingly
becoming important as Xen-based VM systems are getting popular.
Reducing a single-point-of-failure (physical device) is vital on
such VM systems.
And the failover here is not going to address overloaded or
congested networks here, which should be solved separately.
Background
==========
When designing a TCP/IP based network system on failover-capable
network devices, people want to set timeouts hierarchically in
three layers, network device layer, TCP layer, and application
layer (bottom-up order), such that:
1. Network device layer detects a failure first and switch to a
backup device (say, in 20sec).
2. TCP layer timeout & retransmission comes next, _hopefully_
before the application layer timeout.
3. Application layer detects a network failure last (by, say,
30sec timeout) and may trigger a system-level failover.
It should be noted that the timeouts for #1 and #2 are handled
independently and there is no relationship between them.
Also note that the actual timeout settings (20sec or 30sec in
this example) are often determined by systems requirement and so
setting them to certain "safe values" (if any) are usually not
possible.
Problem
=======
If TCP retransmission misses the time frame between event #1 and
#3 in Background above (between 20 and 30sec since network
failure), a failure causes the system-level failover where the
network-device-level failover should be enough.
The problem in this hierarchical timeout scheme is that TCP
layer does not guarantee the next retransmission to occur in
certain period of time. In the above example, people expect TCP
to retransmit a packet between 20 and 30sec since network
failure, but it may not happen.
Starting from RTO=0.5sec for example, retransmission will occur
at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
in the following diagram, but miss the time frame between time
20 and 30.
time: 0 10 20 30sec
| | | |
App. layer |---------+---------+---------X ==> system failover
TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
Netdev layer |---------+---------X ==> network failover
Solution
========
It seems reasonable for me to solve this problem by capping
TCP_RTO_MAX, i.e., making TCP_RTO_MAX a sysctl variable and set
it to a small number.
In this example, setting to (10 - RTT)[sec] at most should work
because retransmission will take place between time 20 and 30.
My rationale follows.
* This solution is simple and so less error-prone.
* The solution does not violate RFC 2988 in maximum RTO value,
because RFC 2988's requirement in maximum RTO value, "at least
60 seconds (in (2.5))," is OPTIONAL.
* The solution adds a system-wide setting, which is preferable
to per-socket setting (by, say, setsockopt or something),
because all application benefits from the solution.
Before posting patches, I would like to hear comments here.
Any comments or suggestions, to make TCP retransmission work
better on failover-capable network devices, are welcome.
Regards,
--
OBATA Noboru (noboru.obata.ar@...achi.com)
Powered by blists - more mailing lists