[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <65a1215d-dcef-4de8-9c57-065bf32ebb53@ideasonboard.com>
Date: Wed, 30 Oct 2024 16:04:26 +0200
From: Tomi Valkeinen <tomi.valkeinen@...asonboard.com>
To: Saravana Kannan <saravanak@...gle.com>,
Francesco Dolcini <francesco.dolcini@...adex.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Aradhya Bhatia <aradhya.bhatia@...ux.dev>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
Devarsh Thakkar <devarsht@...com>,
Dmitry Baryshkov <dmitry.baryshkov@...aro.org>, kernel-team@...roid.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] driver core: fw_devlink: Stop trying to optimize cycle
detection logic
Hi,
On 30/10/2024 06:51, Saravana Kannan wrote:
> On Tue, Oct 29, 2024 at 4:21 AM Tomi Valkeinen
> <tomi.valkeinen@...asonboard.com> wrote:
>>
>> Hi,
>>
>> On 28/10/2024 22:39, Saravana Kannan wrote:
>>> On Mon, Oct 28, 2024 at 1:06 AM Tomi Valkeinen
>>> <tomi.valkeinen@...asonboard.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 26/10/2024 07:52, Saravana Kannan wrote:
>>>>> In attempting to optimize fw_devlink runtime, I introduced numerous cycle
>>>>> detection bugs by foregoing cycle detection logic under specific
>>>>> conditions. Each fix has further narrowed the conditions for optimization.
>>>>>
>>>>> It's time to give up on these optimization attempts and just run the cycle
>>>>> detection logic every time fw_devlink tries to create a device link.
>>>>>
>>>>> The specific bug report that triggered this fix involved a supplier fwnode
>>>>> that never gets a device created for it. Instead, the supplier fwnode is
>>>>> represented by the device that corresponds to an ancestor fwnode.
>>>>>
>>>>> In this case, fw_devlink didn't do any cycle detection because the cycle
>>>>> detection logic is only run when a device link is created between the
>>>>> devices that correspond to the actual consumer and supplier fwnodes.
>>>>>
>>>>> With this change, fw_devlink will run cycle detection logic even when
>>>>> creating SYNC_STATE_ONLY proxy device links from a device that is an
>>>>> ancestor of a consumer fwnode.
>>>>>
>>>>> Reported-by: Tomi Valkeinen <tomi.valkeinen@...asonboard.com>
>>>>> Closes: https://lore.kernel.org/all/1a1ab663-d068-40fb-8c94-f0715403d276@ideasonboard.com/
>>>>> Fixes: 6442d79d880c ("driver core: fw_devlink: Improve detection of overlapping cycles")
>>>>> Signed-off-by: Saravana Kannan <saravanak@...gle.com>
>>>>> ---
>>>>> Greg,
>>>>>
>>>>> I've tested this on my end and it looks ok and nothing fishy is going
>>>>> on. You can pick this up once Tomi gives a Tested-by.
>>>>
>>>> I tested this on TI AM62 SK board. It has an LVDS (OLDI) display and a
>>>> HDMI output, and both displays are connected to the same display
>>>> subsystem. I tested with OLDI single and dual link cases, with and
>>>> without HDMI, and in all cases probing works fine.
>>>>
>>>> Looks good on that front, so:
>>>>
>>>> Tested-by: Tomi Valkeinen <tomi.valkeinen@...asonboard.com>
>>>
>>> Great! Thanks!
>>>
>>>> You also asked for a diff of the devlinks. That part doesn't look so
>>>> good to me, but probably you can tell if it's normal or not.
>>>
>>> TL;DR: All device links in a cycle get marked as
>>> DL_FLAG_SYNC_STATE_ONLY and DL_FLAG_CYCLE (in addition to other
>>> flags). All DL_FLAG_SYNC_STATE_ONLY (not all of them are cycles) will
>>> get deleted after the consumer probes (since they are no longer needed
>>> after that). My guess on what's going on is that with the patch,
>>> fw_devlink found and marked more device links as cycles. Ones that in
>>> the past weren't detected as being part of a cycle but coincidentally
>>> the "post-init" dependency was the one that was getting ignored/not
>>> enforced. So the actual links in a cycle getting deleted after all the
>>> devices have probed is not a problem.
>>
>> Ok. Yep, it might all be fine. I still don't understand all that's going
>> on here, so maybe look at one more case below.
>>
>>> You can enable the "cycle" logs in drivers/base/core.c so it's easier
>>> to follow the cycles fw_devlink detected. But the logs are a bit
>>> cryptic because it tries to print all the multiple cycles that were
>>> detected using a recursive search.
>>>
>>> The non-cycle use for DL_FLAG_SYNC_STATE_ONLY is for parent devices to
>>> put a "proxy-vote" (Hey supplier, you still have a consumer that
>>> hasn't bound to a driver yet) for descendant (children, grand
>>> children) devices that haven't been created yet. So, without the fix
>>> it's possible some descendant child never got to probe and the
>>> DL_FLAG_SYNC_STATE_ONLY device link stuck around.
>>>
>>> If you can confirm all the deleted device links fall into one of these
>>> two categories, then there's no issue here. If you find cases that
>>> aren't clear, then let me know which one and point to specific nodes
>>> in an upstream DTS file and I can take a look.
>>>
>>> Every device link folder has a "sync_state_only" file that says if it
>>> has the DL_FLAG_SYNC_STATE_ONLY set. But to check for the cycle flag,
>>> you'll have to extend the debug log in device_link_add() that goes:
>>> "Linked as a sync state only consumer to......" and print the "flags" param.
>>
>> I added this print.
>>
>> I thought I'll test without any non-upstream patches, so this is booting
>> with the upstream k3-am625-sk.dtb, no overlays. I've attached boot log
>> (with this patch applied), and devlink lists, without and with this patch.
>>
>> As the OLDI stuff is not upstream, I did expect to see less diff, and
>> that is the case. It's still somewhat interesting diff:
>>
>> $ diff devlink-broken.txt devlink-fixed.txt
>> 1d0
>> < i2c:1-0022--i2c:1-003b
>>
>> So that's the gpio expander (exp1: gpio@22 in k3-am625-sk.dts) and the
>> hdmi bridge (sii9022: bridge-hdmi@3b in k3-am62x-sk-common.dtsi). And,
>> indeed, in the log I can see:
>>
>> i2c 1-003b: Linked as a sync state only consumer to 1-0022 (flags 0x3c0)
>> /bus@...00/i2c@...10000/bridge-hdmi@3b Dropping the fwnode link to
>> /bus@...00/i2c@...10000/gpio@22
>>
>> If I'm not mistaken, the above means that the framework has decided
>> there's a (possible) probe time cyclic dependency between the gpio
>> expander and the hdmi bridge, right?
>>
>> I don't think that makes sense, and I was trying to understand why the
>> framework has arrived to such a conclusion, but it's not clear to me.
>>
>> Also, I can see, e.g.:
>>
>> /bus@...00/i2c@...10000: cycle: depends on /bus@...00/dss@...00000
>>
>> So somehow the i2c bus has a dependency on the DSS? The DSS does not
>> depend on the i2c, but the HDMI does, so I can understand that the DSS
>> would have a dependency to i2c. But the other way around?
>
> Thanks for being persistent! :) I think you found a real issue in this patch.
> I'm squeezing these fixes late at night and between my regular work.
> So, apologies in advance for untested patches and me going with
> hunches.
>
> This part probably won't make sense, but I'm "explaining" it here just
> to record it somewhere if I or someone else comes back and looks at
> this after a few months.
>
> What's happening is that when 30200000.dss was added, 20010000.i2c was
> creating a sync state only proxy link for bridge-hdmi@3b (child of
> i2c). But when creating the proxy link, fw_devlink detected the cycle
> between bridge-hdmi@3b and 30200000.dss. That cycle is valid, but this
> patch also results in marking the "proxy" link as part of a cycle
> (when it wasn't). So it incorrectly marked i2c as being in a consumer
> of with dss and part of a cycle.
>
> Later on when running cycle detection logic between bridge-hdmi@3b and
> gpio@22, the cycle detection logic follows the i2c to dss link because
> it thinks the i2c really depends on dss but is part of a cycle.
>
> Try this fix on top of this patch and it should allow probing for all
> the previous broken scenarios AND should avoid dropping some links
> incorrectly.
Thanks! With this change on top, the behavior I see is:
- "ls -1 /sys/class/devlink" is identical between a) no driver core
fixes and b) this patch and the change below, when I test all three
cases I have: 1) no OLDI at all (i.e. plain upstream AM62 SK), 2) OLDI
single-link, 3) OLDI dual link.
- The OLDI single-link probes with these changes (and it didn't probe
before these changes).
In other words, with the patch and the change below, everything seems to
work fine without any devlinks disappearing. So (this time for real):
Tested-by: Tomi Valkeinen <tomi.valkeinen@...asonboard.com>
That said, these kind of changes scare me a bit, and I wouldn't mind
someone else testing on some other platform =).
Francesco, I think you said you have an OLDI single-link platform.
That's still a TI platform, but not the same as I have, so maybe worth
testing out.
Tomi
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2115,11 +2115,6 @@ static int fw_devlink_create_devlink(struct device *con,
> if (link->flags & FWLINK_FLAG_IGNORE)
> return 0;
>
> - if (con->fwnode == link->consumer)
> - flags = fw_devlink_get_flags(link->flags);
> - else
> - flags = FW_DEVLINK_FLAGS_PERMISSIVE;
> -
> /*
> * In some cases, a device P might also be a supplier to its child node
> * C. However, this would defer the probe of C until the probe of P
> @@ -2147,13 +2142,17 @@ static int fw_devlink_create_devlink(struct device *con,
> device_links_write_lock();
> if (__fw_devlink_relax_cycles(link->consumer, sup_handle)) {
> __fwnode_link_cycle(link);
> - flags = fw_devlink_get_flags(link->flags);
> pr_debug("----- cycle: end -----\n");
> pr_info("%pfwf: Fixed dependency cycle(s) with %pfwf\n",
> link->consumer, sup_handle);
> }
> device_links_write_unlock();
>
> + if (con->fwnode == link->consumer)
> + flags = fw_devlink_get_flags(link->flags);
> + else
> + flags = FW_DEVLINK_FLAGS_PERMISSIVE;
> +
> if (sup_handle->flags & FWNODE_FLAG_NOT_DEVICE)
> sup_dev = fwnode_get_next_parent_dev(sup_handle);
> else
>
> Thanks,
> Saravana
>
>>
>> Tomi
>>
>>>>
>>>> $ diff devlink-single-broken.txt devlink-single-fixed.txt
>>>
>>> I was hoping you'd give me some line count diff too to get a sense of
>>> if it's wreaking havoc or not. But based on my local testing on
>>> different hardware, I'm expecting a very small number of device links
>>> are getting affected.
>>>
>>>> 2d1
>>>> < i2c:1-0022--i2c:1-003b
>>>> 11d9
>>>> <
>>>> platform:44043000.system-controller:clock-controller--platform:20010000.i2c
>>>> 27d24
>>>> < platform:44043000.system-controller:clock-controller--platform:601000.gpio
>>>> 42d38
>>>> <
>>>> platform:44043000.system-controller:power-controller--platform:20010000.i2c
>>>> 58d53
>>>> < platform:44043000.system-controller:power-controller--platform:601000.gpio
>>>> 74d68
>>>> < platform:4d000000.mailbox--platform:44043000.system-controller
>>>
>>> I took a quick look at this one in
>>> arch/arm64/boot/dts/ti/k3-am62-main.dtsi which I assume is part of the
>>> device you are testing on and I couldn't find a cycle. But with dtsi
>>> and dts files, it's hard to find these manually. Let me know if
>>> fw_devlink is thinking there's a cycle where there is none.
>>>
>>>> 76d69
>>>> < platform:601000.gpio--i2c:1-0022
>>>> 80d72
>>>> < platform:bus@...00:interrupt-controller@...000--platform:601000.gpio
>>>> 82d73
>>>> < platform:f4000.pinctrl--i2c:1-0022
>>>> 84d74
>>>> < platform:f4000.pinctrl--platform:20010000.i2c
>>>>
>>>> "i2c:1-003b" is the hdmi bridge, "i2c:1-0022" is a gpio expander. So,
>>>> for example, we lose the devlink between the gpio expander and the hdmi
>>>> bridge. The expander is used for interrupts. There's an interrupt line
>>>> from the HDMI bridge to the expander, and from there there's an
>>>> interrupt line going to the SoC.
>>>>
>>>> Also, I noticed the devlinks change if I load the display drivers. The
>>>> above is before loading. Comparing the loaded/not-loaded:
>>>
>>> Yeah, DL_FLAG_SYNC_STATE_ONLY device links vanishing as more devices
>>> probe is not a problem. That's working as intended.
>>>
>>> Thanks,
>>> Saravana
>>>
>>>>
>>>> $ diff devlink-dual-fixed.txt devlink-dual-fixed-loaded.txt
>>>> 3d2
>>>> < i2c:1-003b--platform:30200000.dss
>>>> 23d21
>>>> <
>>>> platform:44043000.system-controller:clock-controller--platform:30200000.dss
>>>> 52d49
>>>> <
>>>> platform:44043000.system-controller:power-controller--platform:30200000.dss
>>>> 73d69
>>>> < platform:display--platform:30200000.dss
>>>> 78d73
>>>> < platform:f4000.pinctrl--platform:30200000.dss
>>>> 97a93
>>>> > regulator:regulator.0--platform:display
>>>>
>>>> Tomi
>>>>
>>>>
>>>>> Thanks,
>>>>> Saravana
>>>>>
>>>>> v1 -> v2:
>>>>> - Removed the RFC tag
>>>>> - Remaned the subject. v1 is https://lore.kernel.org/all/20241025223721.184998-1-saravanak@google.com/T/#u
>>>>> - Added a NULL check to avoid NULL pointer deref
>>>>>
>>>>> drivers/base/core.c | 46 ++++++++++++++++++++-------------------------
>>>>> 1 file changed, 20 insertions(+), 26 deletions(-)
>>>>>
>>>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>>>>> index 3b13fed1c3e3..f96f2e4c76b4 100644
>>>>> --- a/drivers/base/core.c
>>>>> +++ b/drivers/base/core.c
>>>>> @@ -1990,10 +1990,10 @@ static struct device *fwnode_get_next_parent_dev(const struct fwnode_handle *fwn
>>>>> *
>>>>> * Return true if one or more cycles were found. Otherwise, return false.
>>>>> */
>>>>> -static bool __fw_devlink_relax_cycles(struct device *con,
>>>>> +static bool __fw_devlink_relax_cycles(struct fwnode_handle *con_handle,
>>>>> struct fwnode_handle *sup_handle)
>>>>> {
>>>>> - struct device *sup_dev = NULL, *par_dev = NULL;
>>>>> + struct device *sup_dev = NULL, *par_dev = NULL, *con_dev = NULL;
>>>>> struct fwnode_link *link;
>>>>> struct device_link *dev_link;
>>>>> bool ret = false;
>>>>> @@ -2010,22 +2010,22 @@ static bool __fw_devlink_relax_cycles(struct device *con,
>>>>>
>>>>> sup_handle->flags |= FWNODE_FLAG_VISITED;
>>>>>
>>>>> - sup_dev = get_dev_from_fwnode(sup_handle);
>>>>> -
>>>>> /* Termination condition. */
>>>>> - if (sup_dev == con) {
>>>>> + if (sup_handle == con_handle) {
>>>>> pr_debug("----- cycle: start -----\n");
>>>>> ret = true;
>>>>> goto out;
>>>>> }
>>>>>
>>>>> + sup_dev = get_dev_from_fwnode(sup_handle);
>>>>> + con_dev = get_dev_from_fwnode(con_handle);
>>>>> /*
>>>>> * If sup_dev is bound to a driver and @con hasn't started binding to a
>>>>> * driver, sup_dev can't be a consumer of @con. So, no need to check
>>>>> * further.
>>>>> */
>>>>> if (sup_dev && sup_dev->links.status == DL_DEV_DRIVER_BOUND &&
>>>>> - con->links.status == DL_DEV_NO_DRIVER) {
>>>>> + con_dev && con_dev->links.status == DL_DEV_NO_DRIVER) {
>>>>> ret = false;
>>>>> goto out;
>>>>> }
>>>>> @@ -2034,7 +2034,7 @@ static bool __fw_devlink_relax_cycles(struct device *con,
>>>>> if (link->flags & FWLINK_FLAG_IGNORE)
>>>>> continue;
>>>>>
>>>>> - if (__fw_devlink_relax_cycles(con, link->supplier)) {
>>>>> + if (__fw_devlink_relax_cycles(con_handle, link->supplier)) {
>>>>> __fwnode_link_cycle(link);
>>>>> ret = true;
>>>>> }
>>>>> @@ -2049,7 +2049,7 @@ static bool __fw_devlink_relax_cycles(struct device *con,
>>>>> else
>>>>> par_dev = fwnode_get_next_parent_dev(sup_handle);
>>>>>
>>>>> - if (par_dev && __fw_devlink_relax_cycles(con, par_dev->fwnode)) {
>>>>> + if (par_dev && __fw_devlink_relax_cycles(con_handle, par_dev->fwnode)) {
>>>>> pr_debug("%pfwf: cycle: child of %pfwf\n", sup_handle,
>>>>> par_dev->fwnode);
>>>>> ret = true;
>>>>> @@ -2067,7 +2067,7 @@ static bool __fw_devlink_relax_cycles(struct device *con,
>>>>> !(dev_link->flags & DL_FLAG_CYCLE))
>>>>> continue;
>>>>>
>>>>> - if (__fw_devlink_relax_cycles(con,
>>>>> + if (__fw_devlink_relax_cycles(con_handle,
>>>>> dev_link->supplier->fwnode)) {
>>>>> pr_debug("%pfwf: cycle: depends on %pfwf\n", sup_handle,
>>>>> dev_link->supplier->fwnode);
>>>>> @@ -2140,25 +2140,19 @@ static int fw_devlink_create_devlink(struct device *con,
>>>>> return -EINVAL;
>>>>>
>>>>> /*
>>>>> - * SYNC_STATE_ONLY device links don't block probing and supports cycles.
>>>>> - * So, one might expect that cycle detection isn't necessary for them.
>>>>> - * However, if the device link was marked as SYNC_STATE_ONLY because
>>>>> - * it's part of a cycle, then we still need to do cycle detection. This
>>>>> - * is because the consumer and supplier might be part of multiple cycles
>>>>> - * and we need to detect all those cycles.
>>>>> + * Don't try to optimize by not calling the cycle detection logic under
>>>>> + * certain conditions. There's always some corner case that won't get
>>>>> + * detected.
>>>>> */
>>>>> - if (!device_link_flag_is_sync_state_only(flags) ||
>>>>> - flags & DL_FLAG_CYCLE) {
>>>>> - device_links_write_lock();
>>>>> - if (__fw_devlink_relax_cycles(con, sup_handle)) {
>>>>> - __fwnode_link_cycle(link);
>>>>> - flags = fw_devlink_get_flags(link->flags);
>>>>> - pr_debug("----- cycle: end -----\n");
>>>>> - dev_info(con, "Fixed dependency cycle(s) with %pfwf\n",
>>>>> - sup_handle);
>>>>> - }
>>>>> - device_links_write_unlock();
>>>>> + device_links_write_lock();
>>>>> + if (__fw_devlink_relax_cycles(link->consumer, sup_handle)) {
>>>>> + __fwnode_link_cycle(link);
>>>>> + flags = fw_devlink_get_flags(link->flags);
>>>>> + pr_debug("----- cycle: end -----\n");
>>>>> + pr_info("%pfwf: Fixed dependency cycle(s) with %pfwf\n",
>>>>> + link->consumer, sup_handle);
>>>>> }
>>>>> + device_links_write_unlock();
>>>>>
>>>>> if (sup_handle->flags & FWNODE_FLAG_NOT_DEVICE)
>>>>> sup_dev = fwnode_get_next_parent_dev(sup_handle);
>>>>
Powered by blists - more mailing lists