lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 12 Aug 2019 12:10:21 -0300 From: Fabio Estevam <festevam@...il.com> To: Russell King - ARM Linux admin <linux@...linux.org.uk> Cc: "moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE" <linux-arm-kernel@...ts.infradead.org>, netdev <netdev@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>, Florian Fainelli <f.fainelli@...il.com>, Heiner Kallweit <hkallweit1@...il.com> Subject: Re: [BUG] fec mdio times out under system stress Hi Russell, On Sun, Aug 11, 2019 at 10:37 AM Russell King - ARM Linux admin <linux@...linux.org.uk> wrote: > > Hi Fabio, > > When I woke up this morning, I found that one of the Hummingboards > had gone offline (as in, lost network link) during the night. > Investigating, I find that the system had gone into OOM, and at > that time, triggered an unrelated: > > [4111697.698776] fec 2188000.ethernet eth0: MDIO read timeout > [4111697.712996] MII_DATA: 0x6006796d > [4111697.729415] MII_SPEED: 0x0000001a > [4111697.745232] IEVENT: 0x00000000 > [4111697.745242] IMASK: 0x0a8000aa > [4111698.002233] Atheros 8035 ethernet 2188000.ethernet-1:00: PHY state change RUNNING -> HALTED > [4111698.009882] fec 2188000.ethernet eth0: Link is Down > > This is on a dual-core iMX6. > > It looks like the read actually completed (since MII_DATA contains > the register data) but we somehow lost the interrupt (or maybe > received the interrupt after wait_for_completion_timeout() timed > out.) > > From what I can see, the OOM events happened on CPU1, CPU1 was > allocated the FEC interrupt, and the PHY polling that suffered the > MDIO timeout was on CPU0. > > Given that IEVENT is zero, it seems that CPU1 had read serviced the > interrupt, but it is not clear how far through processing that it > was - it may be that fec_enet_interrupt() had been delayed by the > OOM condition. > > This seems rather fragile - as the system slowing down due to OOM > triggers the network to completely collapse by phylib taking the > PHY offline, making the system inaccessible except through the > console. > > In my case, even serial console wasn't operational (except for > magic sysrq). Not sure what agetty was playing at... so the only > way I could recover any information from the system was to connect > the HDMI and plug in a USB keyboard. > > Any thoughts on how FEC MDIO accesses could be made more robust? Sorry for the delay. I am currently on vacation with limited e-mail access. I think it is worth trying Andrew's suggestion to increase FEC_MII_TIMEOUT. Thanks
Powered by blists - more mailing lists