lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGETcx88eFjAh7FEQr_S_WdkCCxaN82j5mnGYG-mvULBF_KYhg@mail.gmail.com>
Date: Fri, 25 Oct 2024 15:40:34 -0700
From: Saravana Kannan <saravanak@...gle.com>
To: Tomi Valkeinen <tomi.valkeinen@...asonboard.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Rafael J. Wysocki" <rafael@...nel.org>, 
	Linux Kernel List <linux-kernel@...r.kernel.org>, Aradhya Bhatia <aradhya.bhatia@...ux.dev>, 
	"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>, Devarsh Thakkar <devarsht@...com>
Subject: Re: fw_devlinks preventing a panel driver from probing

On Thu, Oct 24, 2024 at 3:55 PM Saravana Kannan <saravanak@...gle.com> wrote:
>
> On Wed, Oct 23, 2024 at 1:52 AM Tomi Valkeinen
> <tomi.valkeinen@...asonboard.com> wrote:
> >
> > Hi,
> >
> > On 22/10/2024 19:07, Saravana Kannan wrote:
> > > On Tue, Oct 22, 2024 at 12:51 AM Tomi Valkeinen
> > > <tomi.valkeinen@...asonboard.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> On 22/10/2024 02:29, Saravana Kannan wrote:
> > >>> Hi Tomi,
> > >>>
> > >>> Sorry it took a while to get back.
> > >>>
> > >>> On Mon, Sep 16, 2024 at 4:52 AM Tomi Valkeinen
> > >>> <tomi.valkeinen@...asonboard.com> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We have an issue where two devices have dependencies to each other,
> > >>>> according to drivers/base/core.c's fw_devlinks, and this prevents them
> > >>>> from probing. I've been adding debugging to the core.c, but so far I
> > >>>> don't quite grasp the issue, so I thought to ask. Maybe someone can
> > >>>> instantly say that this just won't work...
> > >>>>
> > >>>> So, we have two devices, DSS (display subsystem) and an LVDS panel. The
> > >>>> DSS normally outputs parallel video from its video ports (VP), but it
> > >>>> has an integrated LVDS block (OLDI, Open LVDS Display Interface). The
> > >>>> OLDI block takes input from DSS's parallel outputs. The OLDI is not
> > >>>> modeled as a separate device (neither in the DT nor in the Linux device
> > >>>> model) as it has no register space, and is controlled fully by the DSS.
> > >>>>
> > >>>> To support dual-link LVDS, the DSS has two OLDI instances. They both
> > >>>> take their input from the same parallel video port, but each OLDI sends
> > >>>> alternate lines forward. So for a dual-link setup the connections would
> > >>>> be like this:
> > >>>>
> > >>>> +-----+-----+         +-------+         +----------+
> > >>>> |     |     |         |       |         |          |
> > >>>> |     | VP1 +----+--->| OLDI0 +-------->|          |
> > >>>> |     |     |    |    |       |         |          |
> > >>>> | DSS +-----+    |    +-------+         |  Panel   |
> > >>>> |     |     |    |    |       |         |          |
> > >>>> |     | VP2 |    +--->| OLDI1 +-------->|          |
> > >>>> |     |     |         |       |         |          |
> > >>>> +-----+-----+         +-------+         +----------+
> > >>>>
> > >>>> As the OLDI is not a separate device, it also does not have an
> > >>>> independent device tree node, but rather it's inside DSS's node. The DSS
> > >>>> parallel outputs are under a normal "ports" node, but OLDI ports are
> > >>>> under "oldi-txes/ports" (see below for dts to clarify this).
> > >>>>
> > >>>> And I think (guess...) this is the root of the issue we're seeing, as it
> > >>>> means the following, one or both of which might be the reason for this
> > >>>> issue:
> > >>>>
> > >>>> - OLDI fwnodes don't have an associated struct device *. I think the
> > >>>> reason is that the OLDI media graph ports are one level too deep in the
> > >>>> hierarchy. So while the DSS ports are associated with the DSS device,
> > >>>> OLDI ports are not.
> > >>>
> > >>> This is the root cause of the issue in some sense. fw_devlink doesn't
> > >>> know that DSS depends on the VP. In the current DT, it only appears as
> > >>> if the OLDI depends on VP. See further below for the fix.
> > >>>
> > >>>>
> > >>>> - The VP ports inside the DSS point to OLDI ports, which are also inside
> > >>>> DSS. So ports from a device point to ports in the same device (and back).
> > >>>>
> > >>>> If I understand the fw_devlink code correctly, in a normal case the
> > >>>> links formed with media graphs are marked as a cycle
> > >>>> (FWLINK_FLAG_CYCLE), and then ignored as far as probing goes.
> > >>>>
> > >>>> What we see here is that when using a single-link OLDI panel, the panel
> > >>>> driver's probe never gets called, as it depends on the OLDI, and the
> > >>>> link between the panel and the OLDI is not a cycle.
> > >>>>
> > >>>> The DSS driver probes, but the probe fails as it requires all the panel
> > >>>> devices to have been probed (and thus registered to the DRM framework)
> > >>>> before it can finish its setup.
> > >>>>
> > >>>> With dual-link, probing does happen and the drivers work. But I believe
> > >>>> this is essentially an accident, in the sense that the first link
> > >>>> between the panel and the OLDI still blocks the probing, but the second
> > >>>> links allows the driver core to traverse the devlinks further, causing
> > >>>> it to mark the links to the panel as FWLINK_FLAG_CYCLE (or maybe it only
> > >>>> marks one of those links, and that's enough).
> > >>>>
> > >>>> If I set fw_devlink=off as a kernel parameter, the probing proceeds
> > >>>> successfully in both single- and dual-link cases.
> > >>>>
> > >>>> Now, my questions is, is this a bug in the driver core, a bug in the DT
> > >>>> bindings, or something in between (DT is fine-ish, but the structure is
> > >>>> something that won't be supported by the driver core).
> > >>>>
> > >>>> And a follow-up question, regardless of the answer to the first one:
> > >>>> which direction should I go from here =).
> > >>>>
> > >>>> The device tree data (simplified) for this is as follows, first the
> > >>>> dual-link case, then the single-link case:
> > >>>>
> > >>>> /* Dual-link */
> > >>>>
> > >>>> dss: dss@...00000 {
> > >>>>           compatible = "ti,am625-dss";
> > >>>>
> > >>>>           oldi-txes {
> > >>>>                   oldi0: oldi@0 {
> > >>>>                           oldi0_ports: ports {
> > >>>>                                   port@0 {
> > >>>>                                           oldi_0_in: endpoint {
> > >>>>                                                   remote-endpoint = <&dpi0_out0>;
> > >>>>                                           };
> > >>>>                                   };
> > >>>>
> > >>>>                                   port@1 {
> > >>>>                                           oldi_0_out: endpoint {
> > >>>>                                                   remote-endpoint = <&lcd_in0>;
> > >>>>                                           };
> > >>>>                                   };
> > >>>>                           };
> > >>>>                   };
> > >>>>
> > >>>>                   oldi1: oldi@1 {
> > >>>>                           oldi1_ports: ports {
> > >>>>                                   port@0 {
> > >>>>                                           oldi_1_in: endpoint {
> > >>>>                                                   remote-endpoint = <&dpi0_out1>;
> > >>>>                                           };
> > >>>>                                   };
> > >>>>
> > >>>>                                   port@1 {
> > >>>>                                           oldi_1_out: endpoint {
> > >>>>                                                   remote-endpoint = <&lcd_in1>;
> > >>>>                                           };
> > >>>>                                   };
> > >>>>                           };
> > >>>>                   };
> > >>>>           };
> > >>>>
> > >>>>           dss_ports: ports {
> > >>>>                   port@0 {
> > >>>>                           dpi0_out0: endpoint@0 {
> > >>>>                                   remote-endpoint = <&oldi_0_in>;
> > >>>>                           };
> > >>>>                           dpi0_out1: endpoint@1 {
> > >>>>                                   remote-endpoint = <&oldi_1_in>;
> > >>>>                           };
> > >>>>                   };
> > >>>>           };
> > >>>> };
> > >>>>
> > >>>> display {
> > >>>>           compatible = "microtips,mf-101hiebcaf0", "panel-simple";
> > >>>
> > >>> In here, add this new property that I added some time back.
> > >>>
> > >>> post-init-providers = <&oldi-txes>;
> > >>
> > >> Thanks! This helps:
> > >>
> > >> post-init-providers = <&oldi0>;
> > >>
> > >> or for dual-link:
> > >>
> > >> post-init-providers = <&oldi0>, <&oldi1>;
> > >>
> > >>> This tells fw_devlink that VP doesn't depend on this node for
> > >>> initialization/probing. This property is basically available to break
> > >>> cycles in DT and mark one of the edges of the cycles as "not a real
> > >>> init dependency".
> > >>>
> > >>> You should do the same for the single link case too.
> > >>
> > >> While this helps, it's not very nice... Every new DT overlay that uses
> > >> OLDI display needs to have these.
> > >
> > > Actually, taking a closer look at the DT and assuming I am visualizing
> > > it correctly in my head, fw_devlink should notice the cycle between
> > > oldi-txes and display and shouldn't block display from probing. Can
> > > you check the cycle detection code and see where it's bailing out
> > > early and not marking the fwnode link with the CYCLE flag?
> > >
> > > __fw_devlink_relax_cycles() is where you want to look. There are a
> > > bunch of debug log messages inside it and around where it's called
> > > from.
> >
> > I'm not quite sure how to read the debug messages. I have attached three
> > kernel logs, with the debug prints enabled in drivers/base/core.c. The
> > "fixed" is the one with the post-init-providers.
> >
> > I load the display drivers as modules after the main boot has happened,
> > and at the end of the logs I have the kernel prints when I load the
> > modules. The single-link.txt also shows the debugfs/devices_deferred file.
> >
> > The relevant strings to search are "dss", "oldi" and "display" (display
> > is the panel).
> >
> > So... All devlinks are supplier-consumer links. How are those created
> > with an OF media graph, as there's no clear supplier nor consumer. In
> > this particular case I see that display is marked as a consumer of oldi,
> > but also dss is marked as a consumer of oldi. Is this just, essentially,
> > random?
>
> No, the cyclic links you see are "sync state only" links. They don't
> enforce any ordering other than "sync_state()" callbacks. So, it's not
> random. In this example, it just ensures that the sync_state()
> callbacks of display and dss will only get called after both those
> devices probe. If it's confusing, try to understand what
> fw_devlink=permissive does. When we see a cycle, we just put all the
> devices in the cycle in "permissive" mode wrt each other.
>
> > Also, as there's no separate device for OLDI, I don't see oldi at all in
> > /sys/class/devlink/. But what I see there is a bit odd...
> >
> > For dual link I get:
> >
> > platform:display--platform:30200000.dss
> >
> > which, I guess, makes sense. But for single link fixed case, I don't
> > have anything there...
> >
> > > Also, can you check debugfs/devices_deferred, or the
> > > "wait_for_supplier" file under /sys/devices/..../display/ to make sure
> > > it's actually stuck on oldi-txes? Just to make cure it's not some
> > > other corner case that's triggered by oldi-txes?
> > >
> > >> I'm still confused about why this is needed. OF graphs are _always_
> > >> two-way links. Doesn't that mean that OF graphs never can be used for
> > >> dependencies, as they go both ways?
> > >
> > > Good question :) Yes, they'll always be ignored as cycles. But with
> > > post-init-providers, it's actually better to add it so that cycles are
> > > broken and better ordering is enforced. See my talk at LPC referee
> > > talk about fw_devlink to see all the benefits you get from it. :)
> >
> > Thanks for the pointer! It was interesting and I now understand the
> > whole devlink thing better, although I have to say the details still
> > escape me... =)
> >
> > Also, isn't post-init-providers describing software behavior, not
> > hardware? It doesn't sound like something we should have in the DT.
>
> Not really. In real hardware, there can't be a cycle for
> initialization. And the current DT properties don't tell us which link
> is not needed for initialization. And post-init-providers is for
> describing which dependency is not needed for hardware initialization.
>
> Take a look at the docs in the dt-schema for post-init-providers and
> that might help.
>
> > >> If so, shouldn't we just always
> > >> ignore all OF graphs for dependency checking?
> > >
> > > There are cases when two devices A and B have remote-endpoints between
> > > them and ALSO have for example a gpio dependency between them. Where
> > > the gpio is the "post-init-supplier". If we don't parse
> > > remote-endpoint and mark the cycles, cases like these don't get to
> > > probe.
> >
> > I'm sorry, I don't understand this. If we have 1) A and B with a (one
> > way) gpio dependency, and 2) A and B with a (one way) gpio dependency
> > _and_ two way media graph dependency, shouldn't the cases behave
> > identically, as the graph dependency should just be ignored?
>
> Let's assume in both cases the A and B point to each other using
> remote-endpoints. Let's also assume in both cases A says it needs a
> GPIO from B. But the real dependency for probing is that B needs A to
> probe first (due to the remote end point).
>
> The only difference between case 1 and 2 is whether fw_devlink
> supports remote-endpoint parsing.
>
> In case 1, fw_devlink has no way of knowing there is a cycle between A
> and B because it doesn't know that B depends on A (due to
> remote-endpoint). So it can't break any cycles and permanently
> prevents A and B from probing because all it sees is that it needs B
> to provide a GPIO to A.
>
> In case 2, fw_devlink sees that B also depends on A. So there's a
> cycle and marks them both as part of a cycle. And now A and B can
> probe.
>
> Also, forgot to say this last time: we need to support remote-endpoint
> to ensure sync_state() callbacks work correctly when X and Y have
> remote-endpoints between them.
>
> > Or maybe I don't understand the example case at all... Why would the
> > gpio be a post-init-supplier? Isn't gpio something you want at init time?
>
> Apparently not. There are examples of GPIOs that are used only
> acquired (get()) at run time in reaction to some event.
>
> Anyway, with all that said, I think I have a solution in mind for you
> that should allow devices to probe without post-init-providers. But
> I'll need to make the changes and then test it. Let me send it to you
> in a few days. Btw, you should always use post-init-providers to break
> cycles and do better enforcing of probe/suspend/resume/shutdown order.

Hey Tomi,

Can you give this a shot and let me know if it fixes your issue? I
have more instruction in the patch too.

https://lore.kernel.org/all/20241025223721.184998-1-saravanak@google.com/T/#u

-Saravana

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ