netdev - Re: NET: r8168/r8169 identifying fix

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56ab5e00-a26e-de3e-3002-3e5b2dfb2009@gmail.com>
Date:   Wed, 11 Aug 2021 16:17:39 +0300
From:   "Late @ Gmail" <ljakku77@...il.com>
To:     Heiner Kallweit <hkallweit1@...il.com>
Cc:     Leon Romanovsky <leon@...nel.org>, netdev@...r.kernel.org,
        nic_swsd@...ltek.com
Subject: Re: NET: r8168/r8169 identifying fix

Hi All,

Ok, I will not send patch with the hacky-stuff anymore. I do realize the hacky stuff is not acceptable, but it is a starting point

.. you gotta start from somewhere, and get some comments about the patch.


That means that I need some help, what i'm trying to grasp is what the patch does, so that the driver

works with my HW (witch needs the patch (hacky yeah) to work out of the boot ok).


The modprobe reload does something to driver/hw/??? that it is ok, when the module's probe return zero at all times.

.. that is hacky partly reason: I'm not familiar with kernel's code base yet -> It may seem hacky for many, but i'm

trying to get picture in my head what and why works like it works.


The hacky-bit works for me -> in principle the way to fix in concept is ok, now just to make it non-hacky and align

with current architechture and design of overall kernel codebase.


hmm, i'm thinking of debugging the module's reloading steps with stacktraces to figure out more and

possibly make quirk bit(s) for re-probe within initial module load & handle unloading properly in

between.


So basicly what i'm trying to reproduce in boot-time is:


    modprobe -r r8169

    modprobe r8169


Sequence.


--Lauri

On 11.8.2021 9.09, Heiner Kallweit wrote:
> On 10.08.2021 23:50, Late @ Gmail wrote:
>> Hi,
>>
>> Now I solved the reloading issue, and r8169 driver works line charm.
>>
> This patch is a complete hack and and in parts simply wrong.
> In addition it misses the basics of how to submit a patch.
> Quality hasn't improved since your first attempt, so better stop
> trying to submit this to mainline.
>
>> Patch:
>>
>> From: Lauri Jakku <lja@....fi>
>> Date: Mon, 9 Aug 2021 21:44:53 +0300
>> Subject: [PATCH] net:realtek:r8169 driver load fix
>>
>>    net:realtek:r8169 driver load fix
>>
>>      Problem:
>>
>>        The problem is that (1st load) fails, but there is valid
>>        HW found (the ID is known) and this patch is applied, the second
>>        time of loading module works ok, and network is connected ok
>>        etc.
>>
>>      Solution:
>>
>>        The driver will trust the HW that reports valid ID and then make
>>        re-loading of the module as it would being reloaded manually.
>>
>>        I do check that if the HW id is read ok from the HW, then pass
>>        -EAGAIN ja try to load 5 times, sleeping 250ms in between.
>>
>> Signed-off-by: Lauri Jakku <lja@....fi>
>> diff --git a/linux-5.14-rc4/drivers/net/ethernet/realtek/r8169_main.c b/linux-5.14-rc4/drivers/net/ethernet/realtek/r8169_main.c
>> index c7af5bc3b..d8e602527 100644
>> --- a/linux-5.14-rc4/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/linux-5.14-rc4/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -634,6 +634,8 @@ struct rtl8169_private {
>>      struct rtl_fw *rtl_fw;
>>  
>>      u32 ocp_base;
>> +
>> +    int retry_probeing;
>>  };
>>  
>>  typedef void (*rtl_generic_fct)(struct rtl8169_private *tp);
>> @@ -5097,13 +5099,16 @@ static int r8169_mdio_register(struct rtl8169_private *tp)
>>      tp->phydev = mdiobus_get_phy(new_bus, 0);
>>      if (!tp->phydev) {
>>          return -ENODEV;
>> -    } else if (!tp->phydev->drv) {
>> -        /* Most chip versions fail with the genphy driver.
>> -         * Therefore ensure that the dedicated PHY driver is loaded.
>> +    } else if (tp->phydev->phy_id != RTL_GIGA_MAC_NONE) {
> You compare two completely different things here. The phy_id has nothing
> to do with the chip version enum.
>
>> +        /* Most chip versions fail with the genphy driver, BUT do rerport valid IW
>> +         * ID. Re-starting the module seem to fix the issue of non-functional driver.
>>           */
>> -        dev_err(&pdev->dev, "no dedicated PHY driver found for PHY ID 0x%08x, maybe realtek.ko needs to be added to initramfs?\n",
>> +        dev_err(&pdev->dev,
>> +            "no dedicated driver, but HW found: PHY PHY ID 0x%08x\n",
>>              tp->phydev->phy_id);
>> -        return -EUNATCH;
>> +
>> +        dev_err(&pdev->dev, "trying re-probe few times..\n");
>> +
>>      }
>>  
>>      tp->phydev->mac_managed_pm = 1;
>> @@ -5250,6 +5255,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>      enum mac_version chipset;
>>      struct net_device *dev;
>>      u16 xid;
>> +    int savederr = 0;
>>  
>>      dev = devm_alloc_etherdev(&pdev->dev, sizeof (*tp));
>>      if (!dev)
>> @@ -5261,6 +5267,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>      tp->dev = dev;
>>      tp->pci_dev = pdev;
>>      tp->supports_gmii = ent->driver_data == RTL_CFG_NO_GBIT ? 0 : 1;
>> +    tp->retry_probeing = 0;
>>      tp->eee_adv = -1;
>>      tp->ocp_base = OCP_STD_PHY_BASE;
>>  
>> @@ -5410,7 +5417,15 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>  
>>      pci_set_drvdata(pdev, tp);
>>  
>> -    rc = r8169_mdio_register(tp);
>> +    savederr = r8169_mdio_register(tp);
>> +
>> +    if (
>> +        (tp->retry_probeing > 0) &&
>> +        (savederr == -EAGAIN)
>> +       ) {
>> +        netdev_info(dev, " retry of probe requested..............");
>> +    }
>> +
>>      if (rc)
>>          return rc;
>>  
>> @@ -5435,6 +5450,14 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>>      if (pci_dev_run_wake(pdev))
>>          pm_runtime_put_sync(&pdev->dev);
>>  
>> +    if (
>> +        (tp->retry_probeing > 0) &&
>> +        (savederr == -EAGAIN)
>> +       ) {
>> +        netdev_info(dev, " retry of probe requested..............");
>> +        return savederr;
> You can not simply return here. You have to clean up.
>
>> +    }
>> +
>>      return 0;
>>  }
>>  
>> diff --git a/linux-5.14-rc4/drivers/net/phy/phy_device.c b/linux-5.14-rc4/drivers/net/phy/phy_device.c
>> index 5d5f9a9ee..59c6ac031 100644
> No mixing of changes in phylib and drivers.
>
>> --- a/linux-5.14-rc4/drivers/net/phy/phy_device.c
>> +++ b/linux-5.14-rc4/drivers/net/phy/phy_device.c
>> @@ -2980,6 +2980,9 @@ struct fwnode_handle *fwnode_get_phy_node(struct fwnode_handle *fwnode)
>>  }
>>  EXPORT_SYMBOL_GPL(fwnode_get_phy_node);
>>  
>> +
>> +static int phy_remove(struct device *dev);
>> +
> No forward declarations.
>
>>  /**
>>   * phy_probe - probe and init a PHY device
>>   * @dev: device to probe and init
>> @@ -2988,13 +2991,22 @@ EXPORT_SYMBOL_GPL(fwnode_get_phy_node);
>>   *   set the state to READY (the driver's init function should
>>   *   set it to STARTING if needed).
>>   */
>> +#define REDO_PROBE_TIMES    5
>>  static int phy_probe(struct device *dev)
>>  {
>>      struct phy_device *phydev = to_phy_device(dev);
>>      struct device_driver *drv = phydev->mdio.dev.driver;
>>      struct phy_driver *phydrv = to_phy_driver(drv);
>> +    int again = 0;
>> +    int savederr = 0;
>> +again_retry:
>>      int err = 0;
>>  
>> +    if (again > 0) {
>> +        pr_err("%s: Re-probe %d of driver.....\n",
>> +               phydrv->name, again);
>> +    }
>> +
>>      phydev->drv = phydrv;
>>  
>>      /* Disable the interrupt if the PHY doesn't support it
>> @@ -3013,6 +3025,17 @@ static int phy_probe(struct device *dev)
>>  
>>      if (phydev->drv->probe) {
>>          err = phydev->drv->probe(phydev);
>> +
>> +        /* If again requested. */
>> +        if (err == -EAGAIN) {
> This doesn't make sense. You check the PHY driver probe return code,
> mixing it up with the MAC driver return code.
>
>> +            again++;
>> +            savederr = err;
>> +            err = 0;
>> +
>> +            pr_info("%s: EAGAIN: %d ...\n",
>> +                phydrv->name, again);
>> +        }
>> +
>>          if (err)
>>              goto out;
>>      }
>> @@ -3081,6 +3104,20 @@ static int phy_probe(struct device *dev)
>>  
>>      mutex_unlock(&phydev->lock);
>>  
>> +    if ((savederr == -EAGAIN) &&
>> +        ((again > 0) && (again < REDO_PROBE_TIMES))
>> +       ) {
>> +        pr_err("%s: Retry removal driver..\n",
>> +            phydrv->name);
>> +
>> +        phy_remove(dev);
>> +
>> +        pr_err("%s: Re-probe driver..\n",
>> +            phydrv->name);
>> +        savederr = 0;
>> +        goto again_retry;
>> +    }
>> +
>>      return err;
>>  }
>>  
>> @@ -3108,6 +3145,7 @@ static int phy_remove(struct device *dev)
>>      return 0;
>>  }
>>  
>> +
>>  static void phy_shutdown(struct device *dev)
>>  {
>>      struct phy_device *phydev = to_phy_device(dev);
>>
>>
>>
>> On 11.3.2021 18.43, gmail wrote:
>>> Heiner Kallweit kirjoitti 11.3.2021 klo 18.23:
>>>> On 11.03.2021 17:00, gmail wrote:
>>>>> 15. huhtik. 2020, 19.18, Heiner Kallweit <hkallweit1@...il.com <mailto:hkallweit1@...il.com>> kirjoitti:
>>>>>
>>>>>     On 15.04.2020 16:39, Lauri Jakku wrote:
>>>>>
>>>>>         Hi, There seems to he Something odd problem, maybe timing
>>>>>         related. Stripped version not workingas expected. I get back to
>>>>>         you, when  i have it working.
>>>>>
>>>>>     There's no point in working on your patch. W/o proper justification it
>>>>>     isn't acceptable anyway. And so far we still don't know which problem
>>>>>     you actually have.
>>>>>     FIRST please provide the requested logs and explain the actual problem
>>>>>     (incl. the commit that caused the regression).
>>>>>
>>>>>
>>>>>              13. huhtik. 2020, 14.46, Lauri Jakku <ljakku77@...il.com
>>>>>         <mailto:ljakku77@...il.com>> kirjoitti: Hi, Fair enough, i'll
>>>>>         strip them. -lja On 2020-04-13 14:34, Leon Romanovsky wrote: On
>>>>>         Mon, Apr 13, 2020 at 02:02:01PM +0300, Lauri Jakku wrote: Hi,
>>>>>         Comments inline. On 2020-04-13 13:58, Leon Romanovsky wrote: On
>>>>>         Mon, Apr 13, 2020 at 01:30:13PM +0300, Lauri Jakku wrote: From
>>>>>         2d41edd4e6455187094f3a13d58c46eeee35aa31 Mon Sep 17 00:00:00
>>>>>         2001 From: Lauri Jakku <lja@....fi> Date: Mon, 13 Apr 2020
>>>>>         13:18:35 +0300 Subject: [PATCH] NET: r8168/r8169 identifying fix
>>>>>         The driver installation determination made properly by checking
>>>>>         PHY vs DRIVER id's. ---
>>>>>         drivers/net/ethernet/realtek/r8169_main.c | 70
>>>>>         ++++++++++++++++++++--- drivers/net/phy/mdio_bus.c | 11 +++- 2
>>>>>         files changed, 72 insertions(+), 9 deletions(-) I would say that
>>>>>         most of the code is debug prints. I tought that they are helpful
>>>>>         to keep, they are using the debug calls, so they are not visible
>>>>>         if user does not like those. You are missing the point of who
>>>>>         are your users. Users want to have working device and the code.
>>>>>         They don't need or like to debug their kernel. Thanks
>>>>>
>>>>>     Hi, now i got time to tackle with this again :) .. I know the proposed fix is quite hack, BUT it does give a clue what is wrong.
>>>>>
>>>>>     Something in subsystem is not working at the first time, but it needs to be reloaded to work ok (second time). So what I will do
>>>>>     is that I try out re-do the module load within the module, if there is known HW id available but driver is not available, that
>>>>>     would be much nicer and user friendly way.
>>>>>
>>>>>
>>>>>     When the module setup it self nicely on first load, then can be the hunt for late-init of subsystem be checked out. Is the HW
>>>>>     not brought up correct way during first time, or does the HW need time to brough up, or what is the cause.
>>>>>
>>>>>     The justification is the same as all HW driver bugs, the improvement is always better to take in. Or do this patch have some-
>>>>>     thing what other patches do not?
>>>>>
>>>>>     Is there legit reason why NOT to improve something, that is clearly issue for others also than just me ? I will take on the
>>>>>     task to fiddle with the module to get it more-less hacky and fully working version. Without the need for user to do something
>>>>>     for the module to work.
>>>>>
>>>>>         --Lauri J.
>>>>>
>>>>>
>>>> I have no clue what you're trying to say. The last patch wasn't acceptable at all.
>>>> If you want to submit a patch:
>>>>
>>>> - Follow kernel code style
>>>> - Explain what the exact problem is, what the root cause is, and how your patch fixes it
>>>> - Explain why you're sure that it doesn't break processing on other chip versions
>>>>    and systems.
>>>>
>>> Ok, i'll make nice patch that has in comment what is the problem and how does the patch help the case at hand.
>>>
>>> I don't know the rootcause, but something in subsystem that possibly is initializing bit slowly, cause the reloading
>>>
>>> of the module provides working network connection, when done via insmod cycle. I'm not sure is it just a timing
>>>
>>> issue or what. I'd like to check where is the driver pointer populated, and put some debugs to see if the issue is just
>>>
>>> timing, let's see.
>>>
>>>
>>> The problem is that (1st load) fails, but there is valid HW found (the ID is known), when the hacky patch of mine
>>>
>>> is applied, the second time of loading module works ok, and network is connected ok etc.
>>>
>>>
>>> I make the change so that when the current HEAD code is going to return failure, i do check that if the HW id is read ok
>>>
>>> from the HW, then pass -EAGAIN ja try to load 5 times, sleeping 250ms in between.
>>>
>>>
>>> --Lauri J.
>>>
>>>
>>>
>>>