lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <533DA12B.8090904@ahsoftware.de>
Date:	Thu, 03 Apr 2014 19:58:03 +0200
From:	Alexander Holler <holler@...oftware.de>
To:	Sebastian Hesselbarth <sebastian.hesselbarth@...il.com>
CC:	Florian Fainelli <f.fainelli@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	netdev <netdev@...r.kernel.org>,
	Michal Simek <michal.simek@...inx.com>,
	David Miller <davem@...emloft.net>
Subject: Bug(s) with netconsole (using mv643xx_eth on Kirkwood)

(I've changed the topic and removed stable@ from the cc-list to reflect 
the current status)

(Long mail, but hopefully a good problem description)

I already knew about problems with netconsole and mv643xx_eth since
4 years, but didn't care a lot because everything else worked flawless,
I even had forgotten that I've enabled netconsole. (But the bugs I've
experienced 4 years ago, seeing no msgs remotely from netconsole seem to
have disappeared).

But now, using 3.14, I hit a bug which killed the ethernet with a 100%
success rate, and, after digging a bit, I've come to the conclusion
that netconsole (together with a maybe broken initialization of the PHY) 
is the source of the problem.

The kernel is 3.14 (mainline) with one reverted patch (7cd1463). This 
patch changed the initialization of the PHY such, that the ethernet dies 
100% reproducible on a Kirkwood 88F6281 based machine. Reverting that 
patch gives me a oneline bug-enabler:

------
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c 
b/drivers/net/ethernet/marvell/mv643xx_eth.c
index e891b48..246f065 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -2095,7 +2095,8 @@ static void port_start(struct mv643xx_eth_private *mp)
                 struct ethtool_cmd cmd;

                 mv643xx_eth_get_settings(mp->dev, &cmd);
-               phy_reset(mp);
+               //phy_reset(mp);
+               phy_init_hw(mp->phy);
                 mv643xx_eth_set_settings(mp->dev, &cmd);
                 phy_start(mp->phy);
         }
------

First I describe what happens at boot:

- Bootloader (U-Boot) enables (somehow) the network such that is usable 
as a console for the bootloader,
- Kernel is loaded and started with netconsole enabled through the 
kernel command line (netconsole=...),
- eth driver probe => PHY reset
- netconsole initializes the network (netpoll_setup) => PHY reset,
- userland starts,
- userland configures network (ip addr add fixedIP ..., a hack used for 
a very early ntpdate before the rootfs becomes rw), I'm not sure if 
that's end up again in a PHY reset.
- userland starts network by using dhcpcd => PHY reset

Now several use cases:

Case 1:
Using plain 3.14 the last step fails with no carrier, because the PHY 
ends up in a never ending reset (BMCR_RESET always set) in 
m88e1111_config_init() called by phy_init_hw() in port_start() in 
mv643xx_eth.

Case 2:
Without enabling netconsole through the kernel command line, I see no 
problems.

Case 3:
If I enable the old phy_reset() in mv643xx_eth, I see no problems.

Case 4:
If I reduce the time the newly used reset in phy_init_hw() spends in
calling mdelay(500) twice to some milliseconds m88e1111_config_init by 
polling for a cleared BMCR_RESET, I see no problems.

Case 5:
If I disable the initialization of the network in the bootloader, 
netconsole even worked 4 years ago. But I haven't looked into that case 
further, because I always want to use the network as a console for the 
bootloader.


Current assumption:

So, after having spend too much time into diagnosing the above stuff (so 
I was right in ignoring the non-working netconsole for 4 years), I've 
comed to the conclusion that some synchronization between 
netconsole/netpoll and the normal network stack or mv643xx_eth is 
missing. That would explain why the PHY ends up in a never ending reset 
and why this only happens reproducible if the PHY reset needs a whole 
second by using mdelay(500) twice (which likely is used to switch
the task to netconsole inbetween). It might be a hw problem too (I 
haven't read the datasheet or looked for any erratas).

I hope everyone who missed some more information is happy now, otherwise
I (again) wasted time to type a problem description (not to speak about 
the already spent time trying to diagnose the problem)

So go on and try to take the almost low hanging fruit. I'm not sure if I
will spend more time on that topic as I already have a working 
patch/workaround and the discussion has become a bit tiresome. Sorry.

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ