linux-kernel - net: never suspend the ethernet PHY on certain boards?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <4693980.Yko7hG0E1C@ada>
Date:   Wed, 26 Jun 2019 13:23:24 +0200
From:   Alexander Dahl <ada@...rsis.com>
To:     netdev@...r.kernel.org
Cc:     linux-kernel@...r.kernel.org, Thomas Pfahl <tpf@...rsis.com>
Subject: net: never suspend the ethernet PHY on certain boards?

Hei hei,

tl;dr: is there a way to prevent an ethernet PHY to ever power down, preferred 
with some dt configuration, not with a hack e.g. patching out suspend 
functions?

With the bugfix 0da70f808029476001109b6cb076737bc04cea2e ("net: macb: do not 
disable MDIO bus at open/close time", came with kernel v4.19, was backported 
to v4.18.7) a problem arises for us, which was masked before for ages, with a 
special combination of SoC, ethernet PHY and other chips on the same board, 
and the linux drivers for that.

The boards use either a at91sam9g20 or a sama5d27 SoC, both using cadence/macb 
as ethernet driver. Both boards have a smsc LAN8720A ethernet phy attached. 
The RMII clock is generated by the PHY, which uses a 25 MHz crystal for that. 
This clock line is of course fed into the SoC/MAC, but also used (you might 
say hijacked) by other chips on the board which depend on that clock being 
_always_ on (at least after initial init on boot). The hardware can not be 
changed, we speak of several hundred boards already sold in the last years. 
O:-)

Symptom is: when calling `ip link set down dev eth0` that clock goes off, the 
other (not soc nor phy) chips depending on that clock, freeze.

I could bisect this behaviour change on a vanilla kernel to the commit 
mentioned above (actually to the backport commit v4.18.7-4-g716fc5ce90cf, 
because I bisected from v4.17.19 to v4.18.20).

What I tracked down so far: macb_close() before the bugfix reset the MPE bit 
in the MAC Network Control Register, which probably prevents the MAC to send 
MDIO telegrams to the PHY? After the bugfix, that bit is not cleared anymore 
(to allow still talking to other PHYs on the same MDIO bus, we don't have that 
case). I assume communicating with the PHY is still possible then.

macb_close() also calls phy_stop() which sets the state of the phy driver 
state machine to PHY_HALTED, with the next run of that state machine 
phy_suspend() is called.

The smsc phy driver has no special suspend/resume functions, but uses 
genphy_suspend(), that one sets BMCR_PDOWN in MII_BMCR register of that 
(standard compliant) PHY. I suspect after that the PHY powers down and the 
clock goes off.

I assume before that bugfix, this power down bit could not be set, because the 
MDIO interface in the MAC had been disabled, so the PHY stayed on. (However 
there's a possible race because in macb_close() the phy_stop() is called 
before macb_reset_hw(), right?)

So far, these are mostly assumptions. I did not use gdb on the drivers or a 
logic analyzer on the MDIO lines. I could do to prove, however.

What I could do:

1) Revert that change on my tree, which would mean reverting a generic bugfix
2) Patch smsc phy driver to not suspend anymore
3) Invent some new way to prevent suspend on a configuration basis (dt?)
4) Anything I did not think of yet

I know 1) or 2) are hacks without a chance to make it to mainline. What would 
be your suggestions for 3) and 4)?

Greets
Alex