[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e8a9f73-7b30-4bcb-0cf5-bd124e7a147e@bang-olufsen.dk>
Date: Wed, 25 Aug 2021 13:40:06 +0000
From: Alvin Šipraga <ALSI@...g-olufsen.dk>
To: Saravana Kannan <saravanak@...gle.com>
CC: Vladimir Oltean <olteanv@...il.com>,
Vladimir Oltean <vladimir.oltean@....com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Jakub Kicinski <kuba@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Florian Fainelli <f.fainelli@...il.com>,
Andrew Lunn <andrew@...n.ch>,
Vivien Didelot <vivien.didelot@...il.com>,
Frank Rowand <frowand.list@...il.com>,
Rob Herring <robh+dt@...nel.org>
Subject: Re: [PATCH net] net: dsa: sja1105: fix use-after-free after calling
of_find_compatible_node, or worse
Hi Saravana,
Sorry for the delayed response.
On 8/23/21 8:50 PM, Saravana Kannan wrote:
> On Sun, Aug 22, 2021 at 7:19 AM Alvin Šipraga <ALSI@...g-olufsen.dk> wrote:
>>
>> Hi Saravana,
>>
>> Thanks for the follow-up. I tested your change and it does the trick:
>> there is no deferral and the PHY driver gets probed first-try during the
>> mdiobus registration during the call to dsa_register_switch().
>
> I'm fairly certain the mdiobus registration happens before
> dsa_register_switch(). It's in the probe call path of the DSA. The
> connecting of the PHYs with the DSA is what happens when
> dsa_register_switch() is called.
Hm, are you sure about this? My understanding is that mdiobus
registration happens as follows:
dsa_register_switch()
-> ...
-> rtl836{6,5mb}_setup() # see [1] for RFC patch with rtl8365mb
-> realtek_smi_setup_mdio()
-> of_mdiobus_register()
As it stands, dsa_register_switch() is currently called from
realtek_smi_probe(). Your patch just moves this call to
realtek_smi_sync_state(), but per the above, the mdiobus registration
happens inside dsa_register_switch(), meaning it doesn't happen in the
probe call path. Or am I missing something? I'm happy to be wrong :-)
[1] https://lore.kernel.org/netdev/20210822193145.1312668-1-alvin@pqrs.dk/T/
>
>> I tested
>> with the switch, PHY, and tagging drivers all builtin, or all modules,
>> and it worked in both cases.
>>
>> On 8/20/21 6:52 PM, Saravana Kannan wrote:
>>> Hi Alvin,
>>>
>>> Can you give this a shot to see if it fixes your issue? It basically
>>> delays the registration of dsa_register_switch() until all the
>>> consumers of this switch have probed. So it has a couple of caveats:
>>
>> Hm, weren't the only consumers the PHYs themselves? It seems like the
>> main effect of your change is that - by doing the actual
>> dsa_register_switch() call after the switch driver probe - the
>> ethernet-switch (provider) is already probed, thereby allowing the PHY
>> (consumer) to probe immediately.
>
> Correct-ish -- if you modify this to account for what I said above.
>
>>
>>> 1. I'm hoping the PHYs are the only consumers of this switch.
>>
>> In my case that is true, if you count the mdio_bus as well:
>>
>> /sys/devices/platform/ethernet-switch# ls -l consumer\:*
>> lrwxrwxrwx 1 root root 0 Aug 22 16:00
>> consumer:mdio_bus:SMI-0 ->
>> ../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0
>> lrwxrwxrwx 1 root root 0 Aug 22 16:00
>> consumer:mdio_bus:SMI-0:00 ->
>> ../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:00
>> lrwxrwxrwx 1 root root 0 Aug 22 16:00
>> consumer:mdio_bus:SMI-0:01 ->
>> ../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:01
>> lrwxrwxrwx 1 root root 0 Aug 22 16:00
>> consumer:mdio_bus:SMI-0:02 ->
>> ../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:02
>> lrwxrwxrwx 1 root root 0 Aug 22 16:00
>> consumer:mdio_bus:SMI-0:03 ->
>> ../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:03
>
> Hmm... mdio_bus being a consumer should prevent the sync_state() from
> being called on "ethernet-switch". What's the value of the "status"
> and "sync_state_only" files inside that mdio_bus folder?
Without your patch:
/sys/devices/platform/ethernet-switch# ls -l consumer\:*
lrwxrwxrwx 1 root root 0 Aug 25 13:42
consumer:mdio_bus:SMI-0 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0
/sys/devices/platform/ethernet-switch# cat consumer\:*/status
available
/sys/devices/platform/ethernet-switch# cat consumer\:*/sync_state_only
1
With your patch:
0.0.0.0@...4-:/sys/devices/platform/ethernet-switch# ls -l consumer\:*
lrwxrwxrwx 1 root root 0 Aug 25 15:03
consumer:mdio_bus:SMI-0 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0
lrwxrwxrwx 1 root root 0 Aug 25 15:03
consumer:mdio_bus:SMI-0:00 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:00
lrwxrwxrwx 1 root root 0 Aug 25 15:03
consumer:mdio_bus:SMI-0:01 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:01
lrwxrwxrwx 1 root root 0 Aug 25 15:03
consumer:mdio_bus:SMI-0:02 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:02
lrwxrwxrwx 1 root root 0 Aug 25 15:03
consumer:mdio_bus:SMI-0:03 ->
../../virtual/devlink/platform:ethernet-switch--mdio_bus:SMI-0:03
0.0.0.0@...4-:/sys/devices/platform/ethernet-switch#
0.0.0.0@...4-:/sys/devices/platform/ethernet-switch#
0.0.0.0@...4-:/sys/devices/platform/ethernet-switch# cat
consumer\:*/status
available
active
active
active
active
0.0.0.0@...4-:/sys/devices/platform/ethernet-switch# cat
consumer\:*/sync_state_only
1
0
0
0
0
Hope that helps you understand what's going on better.
BTW, I noticed that when I build realtek-smi as a module with your
patch, my kernel crashes if I unload the module. I haven't debugged this
because it was just a test, nor did I get a stacktrace. LMK if you want
more info.
>
>>> 2. All of them have to probe successfully before the switch will
>>> register itself.
>>
>> Yes.
>
> Right, it's a yes in your case. But will it be a yes for all instances
> of "realtek,rtl8366rb"?
>
>>> 3. If dsa_register_switch() fails, we can't defer the probe (because
>>> it already succeeded). But I'm not sure if it's a likely error code.
>>
>> It's of course possible that dsa_register_switch() fails. Assuming
>> fw_devlink is doing its job properly, I think the reason is most likely
>> going to be something specific to the driver, such as a communication
>> timeout with the switch hardware itself.
>
> But what if someone sets fw_devlink=permissive? Is it okay to break
> this? There are ways to make this work for fw_devlink=permissive and
> =on -- you check for each and decide where to call
> dsa_register_switch() based on that.
I am new to DSA myself so I think I am unqualified to answer whether
it's OK to break things or not.
>
>> I get the impression that you don't necessarily regard this change as a
>> proper fix, so I'm happy to do further tests if you choose to
>> investigate further.
>
> I thought about this in the background the past few days. I think
> there are a couple of options:
> 1. We (community/Andrew) agree that this driver would only work with
> fw_devlink=on and we can confirm that the other upstream uses of
> "realtek,rtl8366rb" won't have any unprobed consumers problem and
> switch to using my patch. Benefit is that it's a trivial and quick
> change that gets things working again.
> 2. The "realtek,rtl8366rb" driver needs to be fixed to use a
> "component device". A component device is a logical device that
> represents a group of other devices. It's only initialized after all
> these devices have probed successfully. The actual switch should be a
> component device and it should call dsa_register_switch() in it's
> "bind" (equivalent of probe). That way you can explicitly control what
> devices need to be probed instead of depending on sync_state() that
> have a bunch of caveats.
>
> Alvin, do you want to take up (2)?
I can give it a shot, but first:
- It seems Andrew may also need some convincing that this is the
right approach.
- Are you sure that this will solve the problem? See what I wrote
upstairs in this email.
- I have never written - nor can I recall reading - a component
driver, so I would appreciate if you could point me to an upstream
example that you think is illustrative.
There's one more issue: I do not have an RTL8366 to test on - rather, I
encountered this problem in realtek-smi while writing a new subdriver
for it to support the RTL8365MB. So any proposed fix may be perceived as
speculative if I cannot test on the '66 hardware. In that case this may
have to wait until the '65MB subdriver is accepted. Many ifs and maybes,
but don't take it to mean I'm not interested in helping. On the contrary
- I would like to fix this bug since I am affected by it!
Kind regards,
Alvin
>
> -Saravana
>
Powered by blists - more mailing lists