lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <217ae37d-f2b0-1805-5696-11644b058819@redhat.com>
Date:   Tue, 29 Sep 2020 22:08:56 +0200
From:   Hans de Goede <hdegoede@...hat.com>
To:     Petr Tesarik <ptesarik@...e.cz>,
        Heiner Kallweit <hkallweit1@...il.com>
Cc:     Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        netdev@...r.kernel.org, linux-clk@...r.kernel.org
Subject: Re: RTL8402 stops working after hibernate/resume

Hi,

On 9/29/20 9:07 PM, Petr Tesarik wrote:
> Hi Heiner (and now also Hans)!
> 
> @Hans: I'm adding you to this conversation, because you're the author
> of commit b1e3454d39f99, which seems to break the r8169 driver on a
> laptop of mine.

Erm, no, as you bi-sected yourself already commit 9f0b54cd167219
("r8169: move switching optional clock on/off to pll power functions")

Broke your laptop, commit b1e3454d39f99 ("clk: x86: add "ether_clk" alias
for Bay Trail / Cherry Trail") is about 18 months older.

> On Fri, 25 Sep 2020 16:47:54 +0200
> Heiner Kallweit <hkallweit1@...il.com> wrote:
> 
>> On 25.09.2020 14:56, Petr Tesarik wrote:
>>> On Fri, 25 Sep 2020 11:52:41 +0200
>>> Petr Tesarik <ptesarik@...e.cz> wrote:
>>>    
>>>> On Fri, 25 Sep 2020 11:44:09 +0200
>>>> Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>   
>>>>> On 25.09.2020 10:54, Petr Tesarik wrote:
>>>> [...]
>>>>>> Does it make sense to bisect the change that broke the driver for me, or should I rather dispose of this waste^Wlaptop in an environmentally friendly manner? I mean, would you eventually accept a workaround for a few machines with a broken BIOS?
>>>>>>        
>>>>> If the workaround is small and there's little chance to break other stuff: then usually yes.
>>>>> If you can spend the effort to bisect the issue, this would be appreciated.
>>>>
>>>> OK, then I'm going to give it a try.
>>>
>>> Done. The system freezes when this commit is applied:
>>>
>>> commit 9f0b54cd167219266bd3864570ae8f4987b57520
>>> Author: Heiner Kallweit <hkallweit1@...il.com>
>>> Date:   Wed Jun 17 22:55:40 2020 +0200
>>>
>>>      r8169: move switching optional clock on/off to pll power functions
>>>    
>> This sounds weird. On your system tp->clk should be NULL,

Heiner, why do you say that tp->clk should be NULL on Petr's
system? Because it is an x86 based system?

Some X86 SoCs, specifically, the more tablet oriented Bay and Cherry
Trail SoCs, which are much more ARM SoC like then other X86 SoCs do
also use the clock framework and the SoC has a number of external clk
pins which are typically used by audio codecs and by ethernet chips.

>> making
>> clk_prepare_enable() et al no-ops. Please check whether tp->clk
>> is NULL after the call to rtl_get_ether_clk().
> 
> This might be part of the issue. On my system tp->clk is definitely not
> NULL:
> 
> crash> *rtl8169_private.clk 0xffff9277aca58940
>    clk = 0xffff9277ac2c82a0
> 
> crash> *clk 0xffff9277ac2c82a0
> struct clk {
>    core = 0xffff9277aef65d00,
>    dev = 0xffff9277aed000b0,
>    dev_id = 0xffff9277aec60c00 "0000:03:00.2",
>    con_id = 0xffff9277ad04b080 "ether_clk",
>    min_rate = 0,
>    max_rate = 18446744073709551615,
>    exclusive_count = 0,
>    clks_node = {
>      next = 0xffff9277ad2428d8,
>      pprev = 0xffff9277aef65dc8
>    }
> }
> 
> The dev_id corresponds to the Ethernet controller:
> 
> 03:00.2 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL810xE PCI Express Fast Ethernet controller (rev 06)
> 
> Looking at clk_find(), it matches this entry in clocks:
> 
> struct clk_lookup {
>    node = {
>      next = 0xffffffffbc702f40,
>      prev = 0xffff9277bf7190c0
>    },
>    dev_id = 0x0,
>    con_id = 0xffff9277bf719524 "ether_clk",
>    clk = 0x0,
>    clk_hw = 0xffff9277ad2427f8
> }
> 
> That's because this kernel is built with CONFIG_PMC_ATOM=y, and looking
> at the platform initialization code, the "ether_clk" alias is created
> unconditionally. Hans already added.

Petr, unconditionally is not really correct here, just as claiming
above that my commit broke things was not really correct either.

I guess this is mostly semantics, but I don't appreciate
the accusatory tone here.

The code in question binds to a clk-pmc-atom platform_device which
gets instantiated by drivers/platform/x86/pmc_atom.c. Which in turn
binds to a PCI device which is only present on Bay Trail and Cherry
Trail SoCs.

IOW the commit operates as advertised in its Subject:
"clk: x86: add "ether_clk" alias for Bay Trail / Cherry Trail"

So with that all clarified lets try to see if we can figure out
*why* this is actually happening.

Petr, can you describe your hardware in some more detail,
in the bits quoted when you first Cc-ed me there is not that
much detail. What is the vendor and model of your laptop?

Looking closer at commit 9f0b54cd167219
("r8169: move switching optional clock on/off to pll power functions")
I notice that the functions which now enable/disable the clock:
rtl_pll_power_up() and rtl_pll_power_down()

Only run when the interface is up during suspend/resume.
Petr, I guess the laptop is not connected to ethernet when you
hibernate it?

That means that on resume the clock will not be re-enabled.

This is a subtle but important change and I believe that
this is what is breaking things. I guess that the PLL which
rtl_pll_power_up() / rtl_pll_power_down() controls is only
used for ethernet-timing.  But the external clock controlled
through pt->clk is a replacement for using an external
crystal with the r8169 chip. So with it disabled, the entire
chip does not have a clock and is essentially dead.
It can then e.g. not respond to any pci-e reads/writes done
to it.

So I believe that the proper fix for this is to revert
commit 9f0b54cd167219
("r8169: move switching optional clock on/off to pll power functions")

As that caused the whole chip's clock to be left off after
a suspend/resume while the interface is down.

Also some remarks about this while I'm being a bit grumpy about
all this anyways (sorry):

1. 9f0b54cd167219 ("r8169: move switching optional clock on/off
to pll power functions") commit's message does not seem to really
explain why this change was made...

2. If a git blame would have been done to find the commit adding
the clk support: commit c2f6f3ee7f22 ("r8169: Get and enable optional ether_clk clock")
then you could have known that the clk in question is an external
clock for the entire chip, the commit message pretty clearly states
this (although "the entire" part is implied only) :

"On some boards a platform clock is used as clock for the r8169 chip,
this commit adds support for getting and enabling this clock (assuming
it has an "ether_clk" alias set on it).

This is related to commit d31fd43c0f9a ("clk: x86: Do not gate clocks
enabled by the firmware") which is a previous attempt to fix this for some
x86 boards, but this causes all Cherry Trail SoC using boards to not reach
there lowest power states when suspending.

This commit (together with an atom-pmc-clk driver commit adding the alias)
fixes things properly by making the r8169 get the clock and enable it when
it needs it."

Regards,

Hans

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ