linux-kernel - Re: drm/msm: Second DisplayPort regression in 6.8-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2df31f2d-8271-d966-158a-27c6e0581d72@quicinc.com>
Date: Tue, 20 Feb 2024 13:19:54 -0800
From: Abhinav Kumar <quic_abhinavk@...cinc.com>
To: Johan Hovold <johan@...nel.org>, Rob Clark <robdclark@...il.com>,
        "Dmitry
 Baryshkov" <dmitry.baryshkov@...aro.org>,
        Kuogee Hsieh
	<quic_khsieh@...cinc.com>
CC: Sean Paul <sean@...rly.run>,
        Marijn Suijten
	<marijn.suijten@...ainline.org>,
        David Airlie <airlied@...il.com>, "Daniel
 Vetter" <daniel@...ll.ch>,
        Bjorn Andersson <quic_bjorande@...cinc.com>,
        <quic_jesszhan@...cinc.com>, <quic_sbillaka@...cinc.com>,
        <dri-devel@...ts.freedesktop.org>, <freedreno@...ts.freedesktop.org>,
        <linux-arm-msm@...r.kernel.org>, <regressions@...ts.linux.dev>,
        <linux-kernel@...r.kernel.org>
Subject: Re: drm/msm: Second DisplayPort regression in 6.8-rc1

Hi Johan

On 2/19/2024 2:41 AM, Johan Hovold wrote:
> On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
>> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
>>> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> 
>> Since Dmitry had trouble reproducing this issue I took a closer look at
>> the DRM aux bridge series that Abhinav pointed and was able to track
>> down the bridge regressions and come up with a reproducer. I just posted
>> a series fixing this here:
>>
>> 	https://lore.kernel.org/lkml/20240217150228.5788-1-johan+linaro@kernel.org/
>>
>> As I mentioned in the cover letter, I am still seeing intermittent hard
>> resets around the time that the DRM subsystem is initialising, which
>> suggests that we may be dealing with two separate DRM regressions here
>> however.
>>
>> If the hard resets are triggered by something like unclocked hardware,
>> perhaps that bit could this be related to the runtime PM rework?
> 
> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
> 
> [   16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
> [   16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
> [   16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [   16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
> [   16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
> [   16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
> [  OK  ] Listening on Load/Save RF Kill Swit[   16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
> [   16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
> [   16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [   16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
> [   16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
> [   16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [   16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
>           Starting Load/Save Screen Backligh…rightness[   16.959146] bus: 'platform': _ of backlight:backlight...
> [   16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
> [   17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
> [   17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
> [   17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
>           Starting Network Name Resolution...
> [   17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
> [   17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [        Starting Network Time Synchronization...
> [   17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> [   17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
>           Starting Record System Boot/Shutdown in UTMP...
>           Starting Virtual Console Setup...
> [  OK  ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
> [   17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> [   17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> Log Type: B - Since Boot(Power On Reset),  D - Delta,  S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> 
>    < machine is reset by hypervisor >
> 
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
> 

I am not sure if we can conclude like that. Even if we shut off the 
controller when the framebuffer was still being fetched that should only 
cause a blank screen and not a reset because we really don't trigger a 
new register write / read while its fetching so as such there is no new 
hardware access.

One thing I must accept is that there are two differences between 
sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks 
where we tested it more thoroughly without any such issues:

1) with the chromebooks we have depthcharge and not the QC UEFI.

If we are suspecting a hand-off issue here, will it help if we try to 
disable the display in EFI by using "fastboot oem select-display-panel 
none" (assuming this is a fastboot enabled device) and see if you still 
hit the reset issue?

2) chromebooks used "internal_hpd" whereas the pmic_glink method used in 
the sc8280xp.

I am still checking if there are any code paths in the eDP/DP driver 
left exposed due to this difference with pm_runtime which can cause 
this. I am wondering if some sort of drm tracing will help to narrow 
down the reset point.

> In the cases where the machines survives boot, the controller is never
> suspended.
> 
> When investigating this I've also seen intermittent:
> 
> 	[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
> 

So this error I think is because in dp_parser_parse() ---> 
dp_parser_ctrl_res(), we also have a devm_phy_get().

This can return -EDEFER if the phy driver has not yet probed.

I checked the other things inside dp_parser_parse(), others calls seem 
to be purely DT parsing except this one. I think to avoid the confusion, 
we should move devm_phy_get() outside of DT parsing into a separate call 
or atleast add an error log inside devm_phy_get() failure below to 
indicate that it deferred

         io->phy = devm_phy_get(&pdev->dev, "dp");
         if (IS_ERR(io->phy))
                 return PTR_ERR(io->phy);

If my hypothesis is correct on this, then this error log (even though 
misleading) should be harmless for this issue because if we hit 
DRM_ERROR("device tree parsing failed\n"); we will skip the 
devm_pm_runtime_enable().

> which also appears to be related to the runtime PM rework:
> 
> 	https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khsieh@quicinc.com/
> 
> I believe this is enough evidence to conclude that this second
> regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
> incorporate pm_runtime framework into DP driver"):
> 
> #regzbot introduced: 5814b8bf086a
> 
> Has anyone given some thought to how the framebuffer handover is
> supposed to work? It seems we're currently just relying on luck with
> timing.
> 


> Johan