[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1391865273.10160.76.camel@edumazet-glaptop2.roam.corp.google.com>
Date: Sat, 08 Feb 2014 05:14:33 -0800
From: Eric Dumazet <eric.dumazet@...il.com>
To: Thomas Glanzmann <thomas@...nzmann.de>,
John Ogness <john.ogness@...utronix.de>
Cc: Eric Dumazet <edumazet@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
"Nicholas A. Bellinger" <nab@...ux-iscsi.org>,
target-devel <target-devel@...r.kernel.org>,
Linux Network Development <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto
corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB
VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 10:38 +0100, Thomas Glanzmann wrote:
> Hello Eric,
>
> [RESEND: the time it took the VMFS was created was switched between
> on/off so with on it took over 2 minutes with off it took less than 4
> seconds]
>
> [RESEND 2: The throughput graphs were switched as well ;-(]
>
> > * Thomas Glanzmann <thomas@...nzmann.de> [2014-02-07 08:55]:
> > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12
> > > and 15 minutes on 3.14.0-rc2+.
>
> * Nicholas A. Bellinger <nab@...ux-iscsi.org> [2014-02-07 20:30]:
> > Would it be possible to try a couple of different stable kernel
> > versions to help track this down?
>
> I bisected[1] it and found the offending commit f54b311 tcp auto corking
> [2] 'if we have a small send and a previous packet is already in the
> qdisc or device queue, defer until TX completion or we get more data.'
> - Description by David S. Miller
>
> I gathered a pcap with tcp_autocorking on and off.
>
> On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png
>
> Off: - took 4 seconds to create a 500 GB VMFS file system
> sysctl net.ipv4.tcp_autocorking=0
> https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png
> https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png
>
> First graph can be generated by opening bunziping the file, opening it
> in wireshark and select Statistics > IO Grap and change the unit to
> Bytes/Tick. The second graph can be generated by selecting Statistics >
> TCP Stream Graph > Round Trip Time.
>
> You can also see that the round trip time increases by factor 25 at
> least.
>
> I once saw a similar problem with dealyed ACK packets of the
> paravirtulized network driver in xen it caused that the tcp window
> filled up and slowed down the throughput from 30 MB/s to less than 100
> KB/s the symptom was that the login to a Windows desktop took more than
> 10 minutes while it used to be below 30 seconds because the profile of
> the user was loaded slowly from a CIFS server. At that time the culprit
> were also delayed small packets: ACK packets in the CIFS case. However I
> only proofed iSCSI regression so far for tcp auto corking but assume we
> will see many others if we leave it enabled.
>
> I found the problem by doing the following:
> - I compiled kernel by executing the following commands:
> yes '' | make oldconfig
> time make -j 24
> / make modules_install
> / mkinitramfs -o /boot/initrd.img-bisect <version>
>
> - I cleaned the iSCSI configuration after each test by issuing:
> /etc/init.d/target stop
> rm /iscsi?/* /etc/target/*
>
> - I configured iSCSI after each reboot
> cat > lio-v101.conf <<EOF
> set global auto_cd_after_create=false
> /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true
>
> /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10
> /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
>
> saveconfig
> yes
> EOF
> targetcli < lio-v101.conf
> And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy
> to the iSCSI target, configured the portal, rescanned and
> created a 500 GB VMFS 5 filesystem and noticed the time if it
> was longer than 2 minutes it was bad if it was below 10 seconds
> it was good.
> git bisect good/bad
>
> My network config is:
>
> auto bond0
> iface bond0 inet static
> address 10.100.4.62
> netmask 255.255.0.0
> gateway 10.100.0.1
> slaves eth0 eth1
> bond-mode 802.3ad
> bond-miimon 100
>
> auto bond0.101
> iface bond0.101 inet static
> address 10.101.99.4
> netmask 255.255.0.0
>
> auto bond1
> iface bond1 inet static
> address 10.100.5.62
> netmask 255.255.0.0
> slaves eth2 eth3
> bond-mode 802.3ad
> bond-miimon 100
>
> auto bond1.101
> iface bond1.101 inet static
> address 10.101.99.5
> netmask 255.255.0.0
>
> I propose to disable tcp_autocorking by default because it obviously degrades
> iSCSI performance and probably many other protocols. Also the commit mentions
> that applications can explicitly disable auto corking we probably should do
> that for the iSCSI target, but I don't know how. Anyone?
>
> [1] http://pbot.rmdir.de/a65q6MjgV36tZnn5jS-DUQ
> [2] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f54b311142a92ea2e42598e347b84e1655caf8e3
>
> Cheers,
> Thomas
Hi Thomas, thanks a lot for this very detailed bug report.
I think you are hit by other bug(s), lets try to fix it/them instead of
disabling this feature.
John Ogness started a thread yesterday about TCP_NODELAY being hit by
the TCP Small Queue mechanism, which is the base of TCP auto corking.
Two RFC patches were discussed.
One dealing with the TCP_NODELAY flag that John posted, and I'll adapt
it to the current kernel.
One dealing with a possible race, that I suggested (I doubt this could
trigger at every write, but lets fix it anyway)
Here is the combined patch, could you test it ?
Thanks !
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 10435b3b9d0f..3be16727f058 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk)
if ((1 << sk->sk_state) &
(TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
- tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC);
+ tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+ 0, GFP_ATOMIC);
}
/*
* One tasklet per cpu tries to send more skbs.
@@ -1904,7 +1905,15 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
- break;
+ /* It is possible TX completion already happened
+ * before we set TSQ_THROTTLED, so we must
+ * test again the condition.
+ * We abuse smp_mb__after_clear_bit() because
+ * there is no smp_mb__after_set_bit() yet
+ */
+ smp_mb__after_clear_bit();
+ if (atomic_read(&sk->sk_wmem_alloc) > limit)
+ break;
}
limit = mss_now;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists