[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1fd93860-47a9-268c-318f-03d5d70e721b@gmail.com>
Date: Thu, 31 Jan 2019 19:28:20 +0100
From: Heiner Kallweit <hkallweit1@...il.com>
To: Peter Ceiley <peter@...ley.net>, David Chang <dchang@...e.com>
Cc: Realtek linux nic maintainers <nic_swsd@...ltek.com>,
netdev@...r.kernel.org
Subject: Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
Thanks for testing, Peter!
So we have an ASPM-related issue indeed. I'm aware that there are certain
incompatibilities between board chipsets and network chip versions
(although it's not known which combinations are affected).
And we don't know whether it's a hardware or BIOS issue.
Older driver versions dealt with this by simply disabling ASPM in general.
As a result all systems with a supported Realtek chip didn't reach higher
package power-saving states, resulting in significantly reduced battery
lifetime on notebooks.
The network driver has no stake in dealing with the ASPM policies, this
is handled by lower PCI layers.
Unfortunately we can't detect ASPM incompatibilities at runtime. Maybe
we could build some heuristics based on rx_missed percentage, but it's
not clear that ASPM issues always show the same symptoms.
So for now people with affected systems have to set a proper
pcie_aspm.policy parameter.
Just what is not clear to me is why pcie_aspm=off doesn't help.
@David:
I assume you'll check with the affected user to test the ASPM policy
parameter.
Heiner
On 31.01.2019 13:09, Peter Ceiley wrote:
> Hi Heiner,
>
> A quick update on my testing with different pcie_aspm settings:
>
> pcie_aspm=off | no change
> pcie_aspm.policy=default | no change
> pcie_aspm.policy=performance | issue resolved
> pcie_aspm.policy=powersave | issue resolved
> pcie_aspm.policy=powersupersave | issue resolved
>
> It seems the new driver does not play nicely with the default ASPM policy.
>
> As requested, I've included an output of ethtool below when experiencing
> the issue - note that no errors are recorded.
>
> # ethtool -S enp3s0
> NIC statistics:
> tx_packets: 2749
> rx_packets: 4089
> tx_errors: 0
> rx_errors: 0
> rx_missed: 0
> align_errors: 0
> tx_single_collisions: 0
> tx_multi_collisions: 0
> unicast: 4078
> broadcast: 9
> multicast: 2
> tx_aborted: 0
> tx_underrun: 0
>
> David, I hope this helps for your user as well. I appreciate you sharing
> the bug ticket - thanks.
>
> Heiner, thanks very much for your help to date.
>
> Regards,
>
> Peter.
>
> On Thu, 31 Jan 2019 at 18:23, David Chang <dchang@...e.com> wrote:
>>
>> Hi Heiner,
>>
>> On Jan 31, 2019 at 07:35:30 +0100, Heiner Kallweit wrote:
>>> Hi David, two more things:
>>>
>>> 1. Could you please test a recent linux-next kernel?
>>> 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
>>> and compare them.
>>
>> I'm sorry that I do not have the issue machine handy. I would ask
>> our user to do the test. Thanks!
>>
>> Regards,
>> David
>>
>>>
>>> Heiner
>>>
>>>
>>> On 31.01.2019 07:21, Heiner Kallweit wrote:
>>>> David, thanks for the link to the bug ticket.
>>>> I think only a proper bisect can help to find the offending commit.
>>>>
>>>> Heiner
>>>>
>>>>
>>>> On 31.01.2019 03:32, David Chang wrote:
>>>>> Hi,
>>>>>
>>>>> We had a similr case here.
>>>>> - Realtek r8169 receive performance regression in kernel 4.19
>>>>> https://bugzilla.suse.com/show_bug.cgi?id=1119649
>>>>>
>>>>> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
>>>>> The major symptom is there are many rx_missed count.
>>>>>
>>>>>
>>>>> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> recently I had somebody where pcie_aspm=off for whatever reason didn't
>>>>>> do the trick, can you also check with pcie_aspm.policy=performance.
>>>>>
>>>>> We will give it a try later.
>>>>>
>>>>>> And please check with "ethtool -S <if>" whether the chip statistics
>>>>>> show a significant number of errors.
>>>>>>
>>>>>> If this doesn't help you may have to bisect to find the offending commit.
>>>>>
>>>>> We had tried fallback driver to a few previous commits as following,
>>>>> but with no luck.
>>>>>
>>>>> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
>>>>> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
>>>>> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
>>>>> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
>>>>> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
>>>>>
>>>>> Thanks,
>>>>> David Chang
>>>>>
>>>>>>
>>>>>> Heiner
>>>>>>
>>>>>>
>>>>>> On 30.01.2019 10:59, Peter Ceiley wrote:
>>>>>>> Hi Heiner,
>>>>>>>
>>>>>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
>>>>>>> and this made no difference.
>>>>>>>
>>>>>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
>>>>>>> subsequently loaded the module in the running 4.19.18 kernel. I can
>>>>>>> confirm that this immediately resolved the issue and access to the NFS
>>>>>>> shares operated as expected.
>>>>>>>
>>>>>>> I presume this means it is an issue with the r8169 driver included in
>>>>>>> 4.19 onwards?
>>>>>>>
>>>>>>> To answer your last questions:
>>>>>>>
>>>>>>> Base Board Information
>>>>>>> Manufacturer: Alienware
>>>>>>> Product Name: 0PGRP5
>>>>>>> Version: A02
>>>>>>>
>>>>>>> ... and yes, the RTL8168 is the onboard network chip.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Peter.
>>>>>>>
>>>>>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> I think the vendor driver doesn't enable ASPM per default.
>>>>>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
>>>>>>>> Few older systems seem to have issues with ASPM, what kind of
>>>>>>>> system / mainboard are you using? The RTL8168 is the onboard
>>>>>>>> network chip?
>>>>>>>>
>>>>>>>> Rgds, Heiner
>>>>>>>>
>>>>>>>>
>>>>>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
>>>>>>>>> Hi Heiner,
>>>>>>>>>
>>>>>>>>> Thanks, I'll do some more testing. It might not be the driver - I
>>>>>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
>>>>>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
>>>>>>>>> a good idea.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Peter.
>>>>>>>>>
>>>>>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> at a first glance it doesn't look like a typical driver issue.
>>>>>>>>>> What you could do:
>>>>>>>>>>
>>>>>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
>>>>>>>>>>
>>>>>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
>>>>>>>>>>
>>>>>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
>>>>>>>>>>
>>>>>>>>>> Any specific reason why you think root cause is in the driver and not
>>>>>>>>>> elsewhere in the network subsystem?
>>>>>>>>>>
>>>>>>>>>> Heiner
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
>>>>>>>>>>> Hi Heiner,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for getting back to me.
>>>>>>>>>>>
>>>>>>>>>>> No, I don't use jumbo packets.
>>>>>>>>>>>
>>>>>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
>>>>>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
>>>>>>>>>>> establishing a connection and is most notable, for example, on my
>>>>>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
>>>>>>>>>>> larger directories) to list the contents of each directory. Once a
>>>>>>>>>>> transfer begins on a file, I appear to get good bandwidth.
>>>>>>>>>>>
>>>>>>>>>>> I'm unsure of the best scientific data to provide you in order to
>>>>>>>>>>> troubleshoot this issue. Running the following
>>>>>>>>>>>
>>>>>>>>>>> netstat -s |grep retransmitted
>>>>>>>>>>>
>>>>>>>>>>> shows a steady increase in retransmitted segments each time I list the
>>>>>>>>>>> contents of a remote directory, for example, running 'ls' on a
>>>>>>>>>>> directory containing 345 media files did the following using kernel
>>>>>>>>>>> 4.19.18:
>>>>>>>>>>>
>>>>>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
>>>>>>>>>>> the following:
>>>>>>>>>>> real 0m19.867s
>>>>>>>>>>> user 0m0.012s
>>>>>>>>>>> sys 0m0.036s
>>>>>>>>>>>
>>>>>>>>>>> The same command shows no retransmitted segments running kernel
>>>>>>>>>>> 4.18.16 and 'time' showed:
>>>>>>>>>>> real 0m0.300s
>>>>>>>>>>> user 0m0.004s
>>>>>>>>>>> sys 0m0.007s
>>>>>>>>>>>
>>>>>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
>>>>>>>>>>>
>>>>>>>>>>> dmesg XID:
>>>>>>>>>>> [ 2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
>>>>>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
>>>>>>>>>>>
>>>>>>>>>>> # lspci -vv
>>>>>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
>>>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
>>>>>>>>>>> Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>>>>>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>>>>>>>>>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>>>>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>>>>>> Latency: 0, Cache Line Size: 64 bytes
>>>>>>>>>>> Interrupt: pin A routed to IRQ 19
>>>>>>>>>>> Region 0: I/O ports at d000 [size=256]
>>>>>>>>>>> Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
>>>>>>>>>>> Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
>>>>>>>>>>> Capabilities: [40] Power Management version 3
>>>>>>>>>>> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
>>>>>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>>>>>>>>>>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>>>>>>>>>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>>>>>>> Address: 0000000000000000 Data: 0000
>>>>>>>>>>> Capabilities: [70] Express (v2) Endpoint, MSI 01
>>>>>>>>>>> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
>>>>>>>>>>> <512ns, L1 <64us
>>>>>>>>>>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>>>>>>> SlotPowerLimit 10.000W
>>>>>>>>>>> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>>>>>>>>>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>>>>>>>>>> MaxPayload 128 bytes, MaxReadReq 4096 bytes
>>>>>>>>>>> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>>>>>>>>>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
>>>>>>>>>>> Latency L0s unlimited, L1 <64us
>>>>>>>>>>> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>>>>>>>>>>> LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>>>>>>>>>> ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>>>>>>>>>>> LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
>>>>>>>>>>> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>>>>>>> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
>>>>>>>>>>> OBFF Via message/WAKE#
>>>>>>>>>>> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>>>>>>>>>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
>>>>>>>>>>> OBFF Disabled
>>>>>>>>>>> AtomicOpsCtl: ReqEn-
>>>>>>>>>>> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>>>>>>>>>> Transmit Margin: Normal Operating Range,
>>>>>>>>>>> EnterModifiedCompliance- ComplianceSOS-
>>>>>>>>>>> Compliance De-emphasis: -6dB
>>>>>>>>>>> LnkSta2: Current De-emphasis Level: -6dB,
>>>>>>>>>>> EqualizationComplete-, EqualizationPhase1-
>>>>>>>>>>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>>>>>>>>> Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>>>>>>>>>> Vector table: BAR=4 offset=00000000
>>>>>>>>>>> PBA: BAR=4 offset=00000800
>>>>>>>>>>> Capabilities: [d0] Vital Product Data
>>>>>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
>>>>>>>>>>> Not readable
>>>>>>>>>>> Capabilities: [100 v1] Advanced Error Reporting
>>>>>>>>>>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>>>>>>> CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
>>>>>>>>>>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>>>>>>>>>> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
>>>>>>>>>>> ECRCChkCap+ ECRCChkEn-
>>>>>>>>>>> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>>>>>>>>>> HeaderLog: 00000000 00000000 00000000 00000000
>>>>>>>>>>> Capabilities: [140 v1] Virtual Channel
>>>>>>>>>>> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
>>>>>>>>>>> Arb: Fixed- WRR32- WRR64- WRR128-
>>>>>>>>>>> Ctrl: ArbSelect=Fixed
>>>>>>>>>>> Status: InProgress-
>>>>>>>>>>> VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>>>>>>>>>>> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>>>>>>>>>>> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
>>>>>>>>>>> Status: NegoPending- InProgress-
>>>>>>>>>>> Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
>>>>>>>>>>> Capabilities: [170 v1] Latency Tolerance Reporting
>>>>>>>>>>> Max snoop latency: 71680ns
>>>>>>>>>>> Max no snoop latency: 71680ns
>>>>>>>>>>> Kernel driver in use: r8169
>>>>>>>>>>> Kernel modules: r8169
>>>>>>>>>>>
>>>>>>>>>>> Please let me know if you have any other ideas in terms of testing.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> Peter.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been experiencing very poor network performance since Kernel
>>>>>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
>>>>>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
>>>>>>>>>>>>> 4.20.4 & 4.19.18).
>>>>>>>>>>>>>
>>>>>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
>>>>>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
>>>>>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
>>>>>>>>>>>>> differ in that I still have a network connection. I have attempted to
>>>>>>>>>>>>> reload the driver on a running system, but this does not improve the
>>>>>>>>>>>>> situation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
>>>>>>>>>>>>>
>>>>>>>>>>>>> lshw shows:
>>>>>>>>>>>>> description: Ethernet interface
>>>>>>>>>>>>> product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>>>> vendor: Realtek Semiconductor Co., Ltd.
>>>>>>>>>>>>> physical id: 0
>>>>>>>>>>>>> bus info: pci@...0:03:00.0
>>>>>>>>>>>>> logical name: enp3s0
>>>>>>>>>>>>> version: 0c
>>>>>>>>>>>>> serial:
>>>>>>>>>>>>> size: 1Gbit/s
>>>>>>>>>>>>> capacity: 1Gbit/s
>>>>>>>>>>>>> width: 64 bits
>>>>>>>>>>>>> clock: 33MHz
>>>>>>>>>>>>> capabilities: pm msi pciexpress msix vpd bus_master cap_list
>>>>>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
>>>>>>>>>>>>> 1000bt-fd autonegotiation
>>>>>>>>>>>>> configuration: autonegotiation=on broadcast=yes driver=r8169
>>>>>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
>>>>>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
>>>>>>>>>>>>> resources: irq:19 ioport:d000(size=256)
>>>>>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter.
>>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> the description "poor network performance" is quite vague, therefore:
>>>>>>>>>>>>
>>>>>>>>>>>> - Can you provide any measurements?
>>>>>>>>>>>> - iperf results before and after
>>>>>>>>>>>> - statistics about dropped packets (rx and/or tx)
>>>>>>>>>>>> - Do you use jumbo packets?
>>>>>>>>>>>>
>>>>>>>>>>>> Also help would be a "lspci -vv" output for the network card and
>>>>>>>>>>>> the dmesg output line with the chip XID.
>>>>>>>>>>>>
>>>>>>>>>>>> Heiner
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>
Powered by blists - more mailing lists