lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 18 Apr 2017 22:56:00 -0700
From:   Stefan Agner <stefan@...er.ch>
To:     Andy Duan <fugang.duan@....com>
Cc:     fugang.duan@...escale.com, festevam@...il.com,
        netdev@...r.kernel.org, netdev-owner@...r.kernel.org
Subject: RE: FEC on i.MX 7 transmit queue timeout

On 2017-04-18 22:28, Andy Duan wrote:
> From: Stefan Agner <stefan@...er.ch> Sent: Wednesday, April 19, 2017 1:02 PM
>>To: Andy Duan <fugang.duan@....com>
>>Cc: fugang.duan@...escale.com; festevam@...il.com;
>>netdev@...r.kernel.org; netdev-owner@...r.kernel.org
>>Subject: Re: FEC on i.MX 7 transmit queue timeout
>>
>>Hi Andy,
>>
>>On 2017-04-18 19:24, Andy Duan wrote:
>>> On 2017年04月19日 03:46, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> I noticed last week on upstream (v4.11-rc6) on a Colibri iMX7 board
>>>> that after a while (~10 minutes) the detdev wachdog prints a
>>>> stacktrace and the driver then continuously dumps the TX ring. I then
>>>> did a quick test with 4.10, and realized it actually suffers the same
>>>> issue, so it seems not to be a regression. I use a rootfs mounted over NFS...
>>>>
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
>>>> dev_watchdog+0x240/0x244
>>>> NETDEV WATCHDOG: eth0 (fec): transmit queue 2 timed out Modules
>>>> linked in:
>>>> CPU: 0 PID: 0 Comm: swapper/0 Not tainted
>>>> 4.11.0-rc7-00030-g2c4e6bd0c4f0-dirty #330 Hardware name: Freescale
>>>> i.MX7 Dual (Device Tree) [<c02293f0>] (unwind_backtrace) from
>>>> [<c0225820>] (show_stack+0x10/0x14) [<c0225820>] (show_stack) from
>>>> [<c050db6c>] (dump_stack+0x90/0xa0) [<c050db6c>] (dump_stack) from
>>>> [<c023ae68>] (__warn+0xac/0x11c) [<c023ae68>] (__warn) from
>>>> [<c023af10>] (warn_slowpath_fmt+0x38/0x48) [<c023af10>]
>>>> (warn_slowpath_fmt) from [<c088bb8c>]
>>>> (dev_watchdog+0x240/0x244)
>>>> [<c088bb8c>] (dev_watchdog) from [<c0294798>]
>>>> (run_timer_softirq+0x24c/0x708)
>>>> [<c0294798>] (run_timer_softirq) from [<c023f584>]
>>>> (__do_softirq+0x12c/0x2a8)
>>>> [<c023f584>] (__do_softirq) from [<c023f8c4>] (irq_exit+0xdc/0x13c)
>>>> [<c023f8c4>] (irq_exit) from [<c02818ac>]
>>>> (__handle_domain_irq+0xa4/0xf8)
>>>> [<c02818ac>] (__handle_domain_irq) from [<c0201624>]
>>>> (gic_handle_irq+0x34/0xa4)
>>>> [<c0201624>] (gic_handle_irq) from [<c0226338>] (__irq_svc+0x58/0x8c)
>>>> Exception stack(0xc1201f30 to 0xc1201f78)
>>>> 1f20:                                     c0233320 00000000 00000000
>>>> 01400000
>>>> 1f40: c1203d80 ffffe000 00000000 00000000 c107bf10 c0e055b5 c1203d34
>>>> 00000001
>>>> 1f60: c07d2324 c1201f80 c0222ac8 c0222acc 60000013 ffffffff
>>>> [<c0226338>] (__irq_svc) from [<c0222acc>] (arch_cpu_idle+0x38/0x3c)
>>>> [<c0222acc>] (arch_cpu_idle) from [<c0275f24>] (do_idle+0xa8/0x250)
>>>> [<c0275f24>] (do_idle) from [<c02760e4>]
>>>> (cpu_startup_entry+0x18/0x1c) [<c02760e4>] (cpu_startup_entry) from
>>>> [<c1000aa0>]
>>>> (start_kernel+0x3fc/0x45c)
>>>> ---[ end trace 5b0c6dc3466a7918 ]---
>>>> fec 30be0000.ethernet eth0: TX ring dump
>>>> Nr     SC     addr       len  SKB
>>>>    0    0x1c00 0x00000000  590   (null)
>>>>    1    0x1c00 0x00000000  590   (null)
>>>>    2    0x1c00 0x00000000   42   (null)
>>>>    3  H 0x1c00 0x00000000   42   (null)
>>>>    4 S  0x0000 0x00000000    0   (null)
>>>>    5    0x0000 0x00000000    0   (null)
>>>>    6    0x0000 0x00000000    0   (null)
>>>>    7    0x0000 0x00000000    0   (null)
>>>>    8    0x0000 0x00000000    0   (null)
>>>>    9    0x0000 0x00000000    0   (null)
>>>>   10    0x0000 0x00000000    0   (null)
>>>>   11    0x0000 0x00000000    0   (null)
>>>>   12    0x0000 0x00000000    0   (null)
>>>>   13    0x0000 0x00000000    0   (null)
>>>>   14    0x0000 0x00000000    0   (null)
>>>>   15    0x0000 0x00000000    0   (null)
>>>>   16    0x0000 0x00000000    0   (null)
>>>>   17    0x0000 0x00000000    0   (null)
>>>>   18    0x0000 0x00000000    0   (null)
>>>> ...
>>>>
>>>>
>>>> A second TX ring dump from 4.10:
>>>> fec 30be0000.ethernet eth0: TX ring dump
>>>> Nr     SC     addr       len  SKB
>>>>    0    0x1c00 0x00000000   42   (null)
>>>>    1    0x1c00 0x00000000   42   (null)
>>>>    2    0x1c00 0x00000000   90   (null)
>>>>    3    0x1c00 0x00000000   90   (null)
>>>>    4    0x1c00 0x00000000   90   (null)
>>>>    5    0x1c00 0x00000000  218   (null)
>>>>    6    0x1c00 0x00000000  218   (null)
>>>>    7    0x1c00 0x00000000  218   (null)
>>>>    8    0x1c00 0x00000000   90   (null)
>>>>    9    0x1c00 0x00000000  206   (null)
>>>>   10    0x1c00 0x00000000  216   (null)
>>>>   11    0x1c00 0x00000000  216   (null)
>>>>   12    0x1c00 0x00000000  216   (null)
>>>>   13    0x1c00 0x00000000  311   (null)
>>>>   14    0x1c00 0x00000000  178   (null)
>>>>   15    0x1c00 0x00000000  311   (null)
>>>>   16    0x1c00 0x00000000  206   (null)
>>>>   17  H 0x1c00 0x00000000  311   (null)
>>>>   18 S  0x0000 0x00000000    0   (null)
>>>>   19    0x0000 0x00000000    0   (null)
>>> The dump show tx ring is fine.
>>>
>>>>
>>>> The ring dump prints continously, but I can access console every now
>>>> and then. I noticed that the second interrupt seems static (66441, TX
>>>> interrupt?):
>>>>   58:         18     GIC-0 150 Level     30be0000.ethernet
>>>>   59:      66441     GIC-0 151 Level     30be0000.ethernet
>>>>   60:      70477     GIC-0 152 Level     30be0000.ethernet
>>> 150 irq number is for tx/rx queue 1 receive/transmit buffer/frame done.
>>> 151 irq number is for tx/rx queue 2 receive/transmit buffer/frame done.
>>> 152 irq number is for tx/rx queue 0 receive/transmit buffer/frame
>>> done, mii interrupt and others.
>>>
>>> i.MX7D enet has three queues for tx and rx.
>>> It seems netdev pick tx queue 1 rate is very rare by __netdev_pick_tx().
>>
>>Oh ok I see, and it seems to choose queue 2 fairly often...
>>
>>>> Anybody else seen this? Any idea?
>>>>
>>>> In 4.10 as well as 4.11-rc6 the interrupt counts were just over 65536...
>>>> pure chance?
>>>>
>>>>
>>> you can use ethtool to set the irq coalesce like:
>>> ethtool -c eth0 rx-frames 80
>>> ethtool -c eth0 rx-usecs 600
>>> ethtool -c eth0 tx-frames 64
>>> ethtool -c eth0 tx-usenc 700
>>>
>>>
>>> You don't run any test case, just nfs mount rootfs ?
>>> I will setup one imx7d sdb board to run it.
>>
>>I noticed it without doing anything, just boot via NFS. There was always a little
>>bit of activity, at least according to the link (blinks every ~5s).
>>
>>It seemd that it happened a bit earlier when using iperf to exacerbate the
>>problem...
>>
>>I noticed that errata 7885 is not mentioned in the i.MX 7 errata, so I created a
>>new devtype:
>>
>>        }, {
>>                .name = "imx7d-fec",
>>                .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
>>                                FEC_QUIRK_HAS_BUFDESC_EX | FEC_QUIRK_HAS_CSUM |
>>                                FEC_QUIRK_HAS_VLAN | FEC_QUIRK_BUG_CAPTURE |
>>                                FEC_QUIRK_HAS_RACC | FEC_QUIRK_HAS_COALESCE,
>>        }, {
>>
> 
> Upstreaming driver doesn't have the platform_device_id for
> "imx7d-fec", imx7d enet still use imx6sx-fec device id driver.
> It lost FEC_QUIRK_ERR007885 and FEC_QUIRK_HAS_AVB quirk flags.

Also downstream uses imx6sx-fec, at least 4.1.15 GA 2.0.0 release:
http://git.freescale.com/git/cgit.cgi/imx/linux-imx.git/tree/arch/arm/boot/dts/imx7d.dtsi?h=imx_4.1.15_2.0.0_ga#n1380

However, with downstream Linux 4.1 the kernel seems to only use queue 0:
292:          0     GPCV2 118 Edge      30be0000.ethernet
293:          0     GPCV2 119 Edge      30be0000.ethernet
294:     204929     GPCV2 120 Edge      30be0000.ethernet


> 
> You can add these.

I guess if i.MX 7 does not suffer ERR007885 it would be good to add a
new devtype, correct? This also needs a device tree change, since
imx6sx-fec is still in the compatible list... I saw that you sent a
patch to add ERR007885 for imx6ul as well ("net: fec: add ERR007885 for
i.MX6ul enet IP").

My earlier run which showed the stack trace again actually still had
imx6sx-fec in the device tree compatible string, and hence used
ERR007885! So I need to test again...


> I validate imx7d sdb board with 4.11.0-rc6, no such problem after nfs
> mount more than 3.5 hours.
> 

Hm, the Colibri iMX7 uses a different PHY and only supports fast
ethernet. Also, I do tests on a i.MX 7Solo actually, but I can do test
on a i.MX 7Dual tomorrow. But again, with downstream which only uses
queue 0 the issue did never appear.

--
Stefan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ