lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 03 Oct 2018 11:27:56 +0100
From:   David Howells <dhowells@...hat.com>
To:     Realtek linux nic maintainers <nic_swsd@...ltek.com>
cc:     dhowells@...hat.com, netdev@...r.kernel.org
Subject: r8169 tx batching(?) causing performance problems

Hi,

Can someone help me figure out a performance issue that seems to be caused by
an RTL8168g/8111g NIC that seems to be batching up transmissions - or, at
least, not starting immediately that it's given something to transmit?

The setup that I'm dealing with is an AFS filesystem client (the test machine
with the troublesome NIC) and an AFS server machine connected by GigE through
a switch by 2m cables.  The network is pretty much uncontended and the server
is only serving files to the test machine.

AFS uses the RxRPC protocol which runs over UDP, one RxRPC packet per UDP
packet.  RxRPC has two packet types that are of relevance to this issue: DATA
and ACK.  Linux contains an implementation of RxRPC as the AF_RXRPC socket
family and an AFS filesystem that can be found in fs/afs/.

The symptoms I'm seeing are that whilst the client is doing a multi-megabyte
direct-I/O reads over AFS, the server is occasionally stalling in its sending
of DATA packets because the ACK packets being sent by the client are appearing
in batches on the server.

I stuck some tracepoints into the r8169 driver to investigate the issue (see
attached patch).  Here's an excerpt from the trace (note that I didn't enable
the rx tracepoint):

 dd-3179 [000] .N.3    37.095116: net_rtl8169_tx: enp3s0 p=178-179 skb=ba987d36
 nd-2961 [001] d.h1    37.095250: net_rtl8169_interrupt: enp3s0 st=81
 nd-2961 [001] ..s1    37.095253: net_rtl8169_poll: enp3s0 st=81
 dd-3179 [000] ...3    37.095286: net_rtl8169_tx: enp3s0 p=178-180 skb=44d352cf
 dd-3179 [000] .N.3    37.095307: net_rtl8169_tx: enp3s0 p=178-181 skb=53ac88f1
 dd-3179 [000] .N.3    37.095315: net_rtl8169_tx: enp3s0 p=178-182 skb=4f3eedd9
 dd-3179 [000] .N.3    37.095328: net_rtl8169_tx: enp3s0 p=178-183 skb=38c81784
 dd-3179 [000] .N.3    37.095338: net_rtl8169_tx: enp3s0 p=178-184 skb=1f0a8fc3
 dd-3179 [000] .N.3    37.095345: net_rtl8169_tx: enp3s0 p=178-185 skb=281a484d
 dd-3179 [000] .N.3    37.095362: net_rtl8169_tx: enp3s0 p=178-186 skb=13d8a01a
 dd-3179 [000] .N.3    37.095370: net_rtl8169_tx: enp3s0 p=178-187 skb=68fe3d70
 dd-3179 [000] .N.3    37.095382: net_rtl8169_tx: enp3s0 p=178-188 skb=3cf64dd8
 dd-3179 [000] .N.3    37.095390: net_rtl8169_tx: enp3s0 p=178-189 skb=da35591a
 dd-3179 [000] .N.3    37.095403: net_rtl8169_tx: enp3s0 p=178-190 skb=3013974a
 dd-3179 [000] .N.3    37.095410: net_rtl8169_tx: enp3s0 p=178-191 skb=7ecddd8e
 dd-3179 [000] .N.3    37.095427: net_rtl8169_tx: enp3s0 p=178-192 skb=aa30c686
 nd-2961 [001] d.h1    37.095433: net_rtl8169_interrupt: enp3s0 st=85
 nd-2961 [001] ..s1    37.095434: net_rtl8169_poll: enp3s0 st=85
 dd-3179 [000] .N.3    37.095435: net_rtl8169_tx: enp3s0 p=178-193 skb=006b0947
 nd-2961 [001] ..s1    37.095439: net_rtl8169_tx_done: enp3s0 p=178 skb=ba987d36
 nd-2961 [001] ..s1    37.095440: net_rtl8169_tx_done: enp3s0 p=179 skb=44d352cf
 nd-2961 [001] ..s1    37.095441: net_rtl8169_tx_done: enp3s0 p=180 skb=53ac88f1
 nd-2961 [001] ..s1    37.095441: net_rtl8169_tx_done: enp3s0 p=181 skb=4f3eedd9
 nd-2961 [001] ..s1    37.095442: net_rtl8169_tx_done: enp3s0 p=182 skb=38c81784
 nd-2961 [001] ..s1    37.095443: net_rtl8169_tx_done: enp3s0 p=183 skb=1f0a8fc3
 nd-2961 [001] ..s1    37.095443: net_rtl8169_tx_done: enp3s0 p=184 skb=281a484d
 nd-2961 [001] ..s1    37.095444: net_rtl8169_tx_done: enp3s0 p=185 skb=13d8a01a
 nd-2961 [001] ..s1    37.095445: net_rtl8169_tx_done: enp3s0 p=186 skb=68fe3d70
 nd-2961 [001] ..s1    37.095445: net_rtl8169_tx_done: enp3s0 p=187 skb=3cf64dd8
 nd-2961 [001] ..s1    37.095446: net_rtl8169_tx_done: enp3s0 p=188 skb=da35591a
 nd-2961 [001] ..s1    37.095446: net_rtl8169_tx_done: enp3s0 p=189 skb=3013974a
 nd-2961 [001] ..s1    37.095447: net_rtl8169_tx_done: enp3s0 p=190 skb=7ecddd8e

As can be seen, there are a number of packets being transmitted by AF_RXRPC
from the dd command.  Each of these is an ACK packet.  ACK packets are around
850-900 bits in size on the wire according to wireshark.

The first Tx interrupt (status 0x85: TxOk is 0x04) occurs 317uS after the
packet 178 is added to an empty buffer and is deleted 6uS after that.  Packet
190 is deleted 37uS after being added.

What surprises me is that it would seem that an ACK packet should take about
~1uS to transmit and the average interval between ACK packets being added to
the queue is ~12uS, but the ring is not obviously making progress in being
transmitted.

Now, I understand that the dirty_tx and cur_tx ring indices are private to the
driver and not seen by the device, but I added an extra bit of code to count
up the number of descriptors with DescOwn still set, and it always appears to
match the number of packets in the buffer.

So I'm guessing that the NIC does one of a number of things:

 (1) It delays starting transmission until it's got a large enough batch.

 (2) It delays starting transmission until some time has passed since being
     poked.

 (3) It takes time to get the transmitter going for some reason.

 (4) It's batching the update of the Tx ring descriptors and does this right
     before interrupting the CPU.

 (5) Reading/writing the Tx descriptor ring from the device is slow.

But again note that tshark on the server shows the ACKs arriving as a batch -
I've also instrumented the NIC driver for that machine and observed the NIC
apparently receiving ACK packets in batches.

Also, what does the TxDescUnavail status flag signify?  The NIC is setting it,
but nothing in the driver takes any notice of it or clears it.

Thanks,
David
---
commit 4ce09fa309d8fb4b1b910091607c2290c45541f1
Author: David Howells <dhowells@...hat.com>
Date:   Tue Oct 2 21:14:04 2018 +0100

    Add tracepoints for Realtek r8169

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index ab30aaeac6d3..8d3a86a674b7 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -31,6 +31,9 @@
 #include <linux/ipv6.h>
 #include <net/ip6_checksum.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/net_rtl8169.h>
+
 #define MODULENAME "r8169"
 
 #define FIRMWARE_8168D_1	"rtl_nic/rtl8168d-1.fw"
@@ -6197,6 +6200,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	RTL_W8(tp, TxPoll, NPQ);
 
 	mmiowb();
+	trace_net_rtl8169_tx(tp->dev, tp->dirty_tx, tp->cur_tx, skb);
 
 	if (!TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS)) {
 		/* Avoid wrongly optimistic queue wake-up: rtl_tx thread must
@@ -6302,6 +6306,7 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
 		rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb,
 				     tp->TxDescArray + entry);
 		if (status & LastFrag) {
+			trace_net_rtl8169_tx_done(dev, dirty_tx, tx_skb->skb);
 			u64_stats_update_begin(&tp->tx_stats.syncp);
 			tp->tx_stats.packets++;
 			tp->tx_stats.bytes += tx_skb->skb->len;
@@ -6453,6 +6458,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 				dev->stats.multicast++;
 
 			napi_gro_receive(&tp->napi, skb);
+			trace_net_rtl8169_napi_rx(dev, skb);
 
 			u64_stats_update_begin(&tp->rx_stats.syncp);
 			tp->rx_stats.packets++;
@@ -6466,7 +6472,6 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 
 	count = cur_rx - tp->cur_rx;
 	tp->cur_rx = cur_rx;
-
 	return count;
 }
 
@@ -6478,6 +6483,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 	if (status == 0xffff || !(status & (RTL_EVENT_NAPI | tp->event_slow)))
 		return IRQ_NONE;
 
+	trace_net_rtl8169_interrupt(tp->napi.dev, status);
+
 	rtl_irq_disable(tp);
 	napi_schedule_irqoff(&tp->napi);
 
@@ -6559,6 +6566,7 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 
 	status = rtl_get_events(tp);
 	rtl_ack_events(tp, status & ~tp->event_slow);
+	trace_net_rtl8169_poll(dev, status);
 
 	if (status & RTL_EVENT_NAPI_RX)
 		work_done = rtl_rx(dev, tp, (u32) budget);
diff --git a/include/trace/events/net_rtl8169.h b/include/trace/events/net_rtl8169.h
new file mode 100644
index 000000000000..92049078b3fa
--- /dev/null
+++ b/include/trace/events/net_rtl8169.h
@@ -0,0 +1,125 @@
+/* Realtek RTL8169 tracepoints
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@...hat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM net_rtl8169
+
+#if !defined(_TRACE_NET_RTL8169_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NET_RTL8169_H
+
+#include <linux/tracepoint.h>
+#include <linux/errqueue.h>
+
+
+TRACE_EVENT(net_rtl8169_interrupt,
+	    TP_PROTO(struct net_device *netdev, u16 status),
+
+	    TP_ARGS(netdev, status),
+
+	    TP_STRUCT__entry(
+		    __field(u16,			status		)
+		    __array(char,			name, IFNAMSIZ	)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->status = status;
+		    memcpy(__entry->name, netdev->name, IFNAMSIZ);
+			   ),
+
+	    TP_printk("%s st=%x", __entry->name, __entry->status)
+	    );
+
+TRACE_EVENT(net_rtl8169_poll,
+	    TP_PROTO(struct net_device *netdev, u16 status),
+
+	    TP_ARGS(netdev, status),
+
+	    TP_STRUCT__entry(
+		    __field(u16,			status		)
+		    __array(char,			name, IFNAMSIZ	)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->status = status;
+		    memcpy(__entry->name, netdev->name, IFNAMSIZ);
+			   ),
+
+	    TP_printk("%s st=%x", __entry->name, __entry->status)
+	    );
+
+TRACE_EVENT(net_rtl8169_napi_rx,
+	    TP_PROTO(struct net_device *netdev, struct sk_buff *skb),
+
+	    TP_ARGS(netdev, skb),
+
+	    TP_STRUCT__entry(
+		    __field(struct sk_buff *,		skb		)
+		    __array(char,			name, IFNAMSIZ	)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->skb = skb;
+		    memcpy(__entry->name, netdev->name, IFNAMSIZ);
+			   ),
+
+	    TP_printk("%s skb=%p", __entry->name, __entry->skb)
+	    );
+
+TRACE_EVENT(net_rtl8169_tx_done,
+	    TP_PROTO(struct net_device *netdev, unsigned int dirty_tx,
+		     struct sk_buff *skb),
+
+	    TP_ARGS(netdev, dirty_tx, skb),
+
+	    TP_STRUCT__entry(
+		    __field(struct sk_buff *,		skb)
+		    __field(unsigned int,		dirty_tx)
+		    __array(char,			name, IFNAMSIZ	)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->skb = skb;
+		    __entry->dirty_tx = dirty_tx;
+		    memcpy(__entry->name, netdev->name, IFNAMSIZ);
+			   ),
+
+	    TP_printk("%s p=%u skb=%p",
+		      __entry->name, __entry->dirty_tx, __entry->skb)
+	    );
+
+TRACE_EVENT(net_rtl8169_tx,
+	    TP_PROTO(struct net_device *netdev, unsigned int dirty_tx,
+		     unsigned int cur_tx, struct sk_buff *skb),
+
+	    TP_ARGS(netdev, dirty_tx, cur_tx, skb),
+
+	    TP_STRUCT__entry(
+		    __field(struct sk_buff *,		skb		)
+		    __field(unsigned int,		dirty_tx	)
+		    __field(unsigned int,		cur_tx		)
+		    __array(char,			name, IFNAMSIZ	)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->skb = skb;
+		    __entry->dirty_tx = dirty_tx;
+		    __entry->cur_tx = cur_tx;
+		    memcpy(__entry->name, netdev->name, IFNAMSIZ);
+			   ),
+
+	    TP_printk("%s p=%u-%u skb=%p",
+		      __entry->name, __entry->dirty_tx, __entry->cur_tx,
+		      __entry->skb)
+	    );
+
+#endif /* _TRACE_NET_RTL8169_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ