[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ceebc43-6baf-2b4c-af10-70522e97385e@googlemail.com>
Date: Mon, 22 Jan 2018 23:55:58 +0100
From: Oliver Freyermuth <o.freyermuth@...glemail.com>
To: Francois Romieu <romieu@...zoreil.com>
Cc: netdev@...r.kernel.org
Subject: Re: Memory corruption with r8169 across several device revisions and
kernels
Dear Francois, other r8169 experts,
Am 22.01.2018 um 01:09 schrieb Francois Romieu:
> Are you able to retrieve the layout ? That is, does it appear to match:
>
> - r8169 hardware stats DMA buffer ?
> TxOk, RxOk, TxErr, RxErr, ...
>
> - rtnl_link_stats ?
> rx_packets, tx_packets, rx_bytes, tx_bytes, ...
>
> or something else ?
>
It took me a while, somehow it seems the bug does not always occur - potentially there's also some race involved.
Reproducing on a Ubuntu 17.10 system I found the following:
Address in virtual memory || value
0x7f87bb4c6000 || 0x00000217
0x7f87bb4c6008 || 0x000003ab
0x7f87bb4c6018 || 0x00000000
0x7f87bb4c6028 || 0x00000279
0x7f87bb4c6030 || 0x000000e1
0x7f87bb4c6038 || 0x00000051
At almost the same time, I find the following numbers in /proc/self/net/dev for the device:
decimal || hex
RX bytes: 870820 || 0x000d49a4
packets: 945 || 0x000003b1
errs 0 ||
drop 0 ||
fifo 0 ||
frame 0 ||
compressed 0 ||
multicast 83 || 0x00000053
TX bytes: 58505 || 0x0000e489
packets: 535 || 0x00000217
errs 0 ||
drop 0 ||
fifo 0 ||
frame 0 ||
compressed 0 ||
multicast 0 ||
Since there was a small delay in time (reading from /proc/self/net/dev happened a few seconds later),
these values are by a few packets off from the memory dump.
So I deduce the layout:
0x7f87bb4c6000 TX Packets
0x7f87bb4c6008 RX Packets
0x7f87bb4c6010 * corruption not seen by memtester for whatever reason *
0x7f87bb4c6018 ???
0x7f87bb4c6020 * corruption not seen by memtester for whatever reason *
0x7f87bb4c6028 ???
0x7f87bb4c6030 ???
0x7f87bb4c6038 RX multicast (?)
So the only thing which is fully clear is that there are TX Packets and after that RX Packets.
Checking through the driver sources, I find rtnl_link_stats64 can not be the culprit, since it has rx_packets and only after tx_packets.
However, struct rtl8169_counters looks like:
struct rtl8169_counters {
__le64 tx_packets;
__le64 rx_packets;
__le64 tx_errors;
__le32 rx_errors;
__le16 rx_missed;
__le16 align_errors;
__le32 tx_one_collision;
__le32 tx_multi_collision;
__le64 rx_unicast;
__le64 rx_broadcast;
__le32 rx_multicast;
__le16 tx_aborted;
__le16 tx_underun;
};
This looks like it could very well match the structure found in memory, so something would be broken related to rtl8169_do_counters, in the DMA transfer.
Does this help - can I provide more info? I get the feeling this affects many tens of thousands of systems and just has been hidden due to
network stats being read rarely...
Cheers,
Oliver
Powered by blists - more mailing lists