linux-kernel - Re: [PATCH 3/3] drivers/net/ftgmac100: fix DHCP potential failure with systemd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b14ad6f-f8a5-bc8c-f0be-d0fda8e908a1@gmail.com>
Date:   Wed, 23 Feb 2022 10:05:39 -0800
From:   Florian Fainelli <f.fainelli@...il.com>
To:     Heyi Guo <guoheyi@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Cc:     Andrew Lunn <andrew@...n.ch>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Joel Stanley <joel@....id.au>,
        Guangbin Huang <huangguangbin2@...wei.com>,
        Hao Chen <chenhao288@...ilicon.com>,
        Arnd Bergmann <arnd@...db.de>,
        Dylan Hung <dylan_hung@...eedtech.com>, netdev@...r.kernel.org
Subject: Re: [PATCH 3/3] drivers/net/ftgmac100: fix DHCP potential failure
 with systemd



On 2/23/2022 9:55 AM, Florian Fainelli wrote:
> 
> 
> On 2/23/2022 3:39 AM, Heyi Guo wrote:
>> Hi Florian,
>>
>> 在 2022/2/23 下午1:00, Florian Fainelli 写道:
>>>
>>>
>>> On 2/22/2022 7:14 PM, Heyi Guo wrote:
>>>> DHCP failures were observed with systemd 247.6. The issue could be
>>>> reproduced by rebooting Aspeed 2600 and then running ifconfig ethX
>>>> down/up.
>>>>
>>>> It is caused by below procedures in the driver:
>>>>
>>>> 1. ftgmac100_open() enables net interface and call phy_start()
>>>> 2. When PHY is link up, it calls netif_carrier_on() and then
>>>> adjust_link callback
>>>> 3. ftgmac100_adjust_link() will schedule the reset task
>>>> 4. ftgmac100_reset_task() will then reset the MAC in another schedule
>>>>
>>>> After step 2, systemd will be notified to send DHCP discover packet,
>>>> while the packet might be corrupted by MAC reset operation in step 4.
>>>>
>>>> Call ftgmac100_reset() directly instead of scheduling task to fix the
>>>> issue.
>>>>
>>>> Signed-off-by: Heyi Guo <guoheyi@...ux.alibaba.com>
>>>> ---
>>>> Cc: Andrew Lunn <andrew@...n.ch>
>>>> Cc: "David S. Miller" <davem@...emloft.net>
>>>> Cc: Jakub Kicinski <kuba@...nel.org>
>>>> Cc: Joel Stanley <joel@....id.au>
>>>> Cc: Guangbin Huang <huangguangbin2@...wei.com>
>>>> Cc: Hao Chen <chenhao288@...ilicon.com>
>>>> Cc: Arnd Bergmann <arnd@...db.de>
>>>> Cc: Dylan Hung <dylan_hung@...eedtech.com>
>>>> Cc: netdev@...r.kernel.org
>>>>
>>>>
>>>> ---
>>>>   drivers/net/ethernet/faraday/ftgmac100.c | 13 +++++++++++--
>>>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
>>>> b/drivers/net/ethernet/faraday/ftgmac100.c
>>>> index c1deb6e5d26c5..d5356db7539a4 100644
>>>> --- a/drivers/net/ethernet/faraday/ftgmac100.c
>>>> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
>>>> @@ -1402,8 +1402,17 @@ static void ftgmac100_adjust_link(struct 
>>>> net_device *netdev)
>>>>       /* Disable all interrupts */
>>>>       iowrite32(0, priv->base + FTGMAC100_OFFSET_IER);
>>>>   -    /* Reset the adapter asynchronously */
>>>> -    schedule_work(&priv->reset_task);
>>>> +    /* Release phy lock to allow ftgmac100_reset to aquire it, 
>>>> keeping lock
>>>
>>> typo: acquire
>>>
>> Thanks for the catch :)
>>>> +     * order consistent to prevent dead lock.
>>>> +     */
>>>> +    if (netdev->phydev)
>>>> +        mutex_unlock(&netdev->phydev->lock);
>>>> +
>>>> +    ftgmac100_reset(priv);
>>>> +
>>>> +    if (netdev->phydev)
>>>> +        mutex_lock(&netdev->phydev->lock);
>>>
>>> Do you really need to perform a full MAC reset whenever the link goes 
>>> up or down? Instead cannot you just extract the maccr configuration 
>>> which adjusts the speed and be done with it?
>>
>> This is the original behavior and not changed in this patch set, and 
>> I'm not familiar with the hardware design of ftgmac100, so I'd like to 
>> limit the changes to the code which really causes practical issues.
> 
> This unlocking and re-locking seems superfluous when you could introduce 
> a version of ftgmac100_reset() which does not acquire the PHY device 
> mutex, and have that version called from ftgmac100_adjust_link(). For 
> every other call site, you would acquire it. Something like this for 
> instance:
> 
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 691605c15265..98179c3fd9ee 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1038,7 +1038,7 @@ static void ftgmac100_adjust_link(struct 
> net_device *netdev)
>          iowrite32(0, priv->base + FTGMAC100_OFFSET_IER);
> 
>          /* Reset the adapter asynchronously */
> -       schedule_work(&priv->reset_task);
> +       ftgmac100_reset(priv, false);
>   }
> 
>   static int ftgmac100_mii_probe(struct net_device *netdev)
> @@ -1410,10 +1410,8 @@ static int ftgmac100_init_all(struct ftgmac100 
> *priv, bool ignore_alloc_err)
>          return err;
>   }
> 
> -static void ftgmac100_reset_task(struct work_struct *work)
> +static void ftgmac100_reset_task(struct ftgmac100_priv *priv, bool 
> lock_phy)
>   {
> -       struct ftgmac100 *priv = container_of(work, struct ftgmac100,
> -                                             reset_task);
>          struct net_device *netdev = priv->netdev;
>          int err;
> 
> @@ -1421,7 +1419,7 @@ static void ftgmac100_reset_task(struct 
> work_struct *work)
> 
>          /* Lock the world */
>          rtnl_lock();
> -       if (netdev->phydev)
> +       if (netdev->phydev && lock_phy)
>                  mutex_lock(&netdev->phydev->lock);
>          if (priv->mii_bus)
>                  mutex_lock(&priv->mii_bus->mdio_lock);
> @@ -1454,11 +1452,19 @@ static void ftgmac100_reset_task(struct 
> work_struct *work)
>    bail:
>          if (priv->mii_bus)
>                  mutex_unlock(&priv->mii_bus->mdio_lock);
> -       if (netdev->phydev)
> +       if (netdev->phydev && lock_phy)
>                  mutex_unlock(&netdev->phydev->lock);
>          rtnl_unlock();
>   }
> 
> +static void ftgmac100_reset_task(struct work_struct *work)
> +{
> +       struct ftgmac100 *priv = container_of(work, struct ftgmac100,
> +                                             reset_task);
> +
> +       ftgmac100_reset(priv, true);
> +}
> +
>   static int ftgmac100_open(struct net_device *netdev)
>   {
>          struct ftgmac100 *priv = netdev_priv(netdev)

Well this whole patch series has been applied already so I guess those 
comments are partially or totally moot now.

I have not received my notification about these patches being applied, 
unless when Jakub applies them, so either it is another vger/gmail lag 
that is absolutely unnerving or it is a difference of process between 
David and Jakub, in which case it really ought to be fixed such that it 
is consistent.
-- 
Florian