linux-kernel - Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e4864046-e52f-63b6-a490-74c3cd8045f4@nvidia.com>
Date:   Tue, 13 Apr 2021 09:41:18 +0100
From:   Jon Hunter <jonathanh@...dia.com>
To:     Joakim Zhang <qiangqing.zhang@....com>,
        Giuseppe Cavallaro <peppe.cavallaro@...com>,
        Alexandre Torgue <alexandre.torgue@...com>,
        Jose Abreu <joabreu@...opsys.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-tegra <linux-tegra@...r.kernel.org>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
 resume back


On 01/04/2021 17:28, Jon Hunter wrote:
> 
> On 31/03/2021 12:41, Joakim Zhang wrote:
> 
> ...
> 
>>> In answer to your question, resuming from suspend does work on this board
>>> without your change. We have been testing suspend/resume now on this board
>>> since Linux v5.8 and so we have the ability to bisect such regressions. So it is
>>> clear to me that this is the change that caused this, but I am not sure why.
>>
>> Yes, I know this issue is regression caused by my patch. I just want to analyze the potential reasons. Due to the code change only related to the page recycle and reallocate.
>> So I guess if this page operate need IOMMU works when IOMMU is enabled. Could you help check if IOMMU driver resume before STMMAC? Our common desire is to find the root cause, right?
> 
> 
> Yes of course that is the desire here indeed. I had assumed that the
> suspend/resume order was good because we have never seen any problems,
> but nonetheless it is always good to check. Using ftrace I enabled
> tracing of the appropriate suspend/resume functions and this is what
> I see ...
> 
> # tracer: function
> #
> # entries-in-buffer/entries-written: 4/4   #P:6
> #
> #                                _-----=> irqs-off
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| /     delay
> #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
> #              | |         |   ||||      |         |
>          rtcwake-748     [000] ...1   536.700777: stmmac_pltfr_suspend <-platform_pm_suspend
>          rtcwake-748     [000] ...1   536.735532: arm_smmu_pm_suspend <-platform_pm_suspend
>          rtcwake-748     [000] ...1   536.757290: arm_smmu_pm_resume <-platform_pm_resume
>          rtcwake-748     [003] ...1   536.856771: stmmac_pltfr_resume <-platform_pm_resume
> 
> 
> So I don't see any ordering issues that could be causing this. 


Another thing I have found is that for our platform, if the driver for
the ethernet PHY (in this case broadcom PHY) is enabled, then it fails
to resume but if I disable the PHY in the kernel configuration, then
resume works. I have found that if I move the reinit of the RX buffers
to before the startup of the phy, then it can resume OK with the PHY
enabled.

Does the following work for you? Does your platform use a specific
ethernet PHY driver?

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 208cae344ffa..071d15d86dbe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -5416,19 +5416,20 @@ int stmmac_resume(struct device *dev)
                        return ret;
        }
+       rtnl_lock();
+       mutex_lock(&priv->lock);
+       stmmac_reinit_rx_buffers(priv);
+       mutex_unlock(&priv->lock);
+
        if (!device_may_wakeup(priv->device) || !priv->plat->pmt) {
-               rtnl_lock();
                phylink_start(priv->phylink);
                /* We may have called phylink_speed_down before */
                phylink_speed_up(priv->phylink);
-               rtnl_unlock();
        }
-       rtnl_lock();
        mutex_lock(&priv->lock);
        stmmac_reset_queues_param(priv);
-       stmmac_reinit_rx_buffers(priv);
        stmmac_free_tx_skbufs(priv);
        stmmac_clear_descriptors(priv);


It is still not clear to us why the existing call to
stmmac_clear_descriptors() is not sufficient to fix your problem.

How often does the issue you see occur?

Thanks
Jon

-- 
nvpublic