linux-kernel - Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1ae6e39-10c3-0ee1-11f4-3436c3e4ec1a@quicinc.com>
Date: Mon, 11 Mar 2024 09:51:29 -0700
From: Abhinav Kumar <quic_abhinavk@...cinc.com>
To: Johan Hovold <johan@...nel.org>
CC: Rob Clark <robdclark@...il.com>,
        Dmitry Baryshkov
	<dmitry.baryshkov@...aro.org>,
        Kuogee Hsieh <quic_khsieh@...cinc.com>,
        "Daniel Thompson" <daniel.thompson@...aro.org>,
        Sean Paul <sean@...rly.run>,
        "Marijn Suijten" <marijn.suijten@...ainline.org>,
        David Airlie
	<airlied@...il.com>, Daniel Vetter <daniel@...ll.ch>,
        Bjorn Andersson
	<quic_bjorande@...cinc.com>,
        <quic_jesszhan@...cinc.com>, <quic_sbillaka@...cinc.com>,
        <dri-devel@...ts.freedesktop.org>, <freedreno@...ts.freedesktop.org>,
        <linux-arm-msm@...r.kernel.org>, <regressions@...ts.linux.dev>,
        <linux-kernel@...r.kernel.org>
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1



On 3/11/2024 6:43 AM, Johan Hovold wrote:
> On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
>> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
>>> On 3/8/2024 4:43 AM, Johan Hovold wrote:
>>
>>> For this last remaining reset with the stacktrace you have mentioned
>>> below, I do not think this was introduced due to PM runtime series. We
>>> have had this report earlier with the same stacktrace as well in earlier
>>> kernels which didnt have PM runtime support:
>>
>>> But, it seems like there is another race condition in this code
>>> resulting in this issue again.
>>>
>>> I have posted my analysis with the patch here as a RFC y'day:
>>>
>>> https://patchwork.freedesktop.org/patch/581758/
>>>
>>> I missed CCing you and Bjorn on the RFC but when I post it as a patch
>>> either today/tomm, will CC you both.
>>
>> Ok, thanks. I'll take a closer look at that next week. It's not clear to
>> me what that race looks like after reading the commit message. It would
>> be good to have some more details in there (e.g. the sequence of events
>> that breaks the state machine) and some way to reproduce the issue
>> reliably.
> 
> I was able to reproduce the reset with some of my debug printks in place
> after reapplying the reverted hpd notify change so I have an explanation
> for (one of) the ways we can up in this state now.
> 
> This does not match description of the problem in the fix linked to
> above and I don't think that patch solves the underlying issue even if
> it may make the race window somewhat smaller. More details below.
>   

Its the same condition you described below that enable does not go 
through and we bail out as we are in ST_DISCONNECTED.

>>>> We now also have Bjorn's call trace from a different Qualcomm platform
>>>> triggering an unclocked access on disconnect, something which would
>>>> trigger a reset by the hypervisor on sc8280xp platforms like the X13s:
>>
>>>> [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
>>>> [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
>>>> [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
>>>> [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
>>>> [ 2030.379642] Call trace:
>>
>>>> [ 2030.379722]  wait_for_completion_timeout+0x14/0x20
>>>> [ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
>>>> [ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
>>>> [ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
>>>> [ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
>>>> [ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
>>>> [ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
>>>> [ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
>>>> [ 2030.380418]  drm_atomic_commit+0xa8/0xd4 [drm]
>>>> [ 2030.380529]  drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
>>>> [ 2030.380641]  drm_client_modeset_commit_locked+0x50/0x168 [drm]
>>>> [ 2030.380753]  drm_client_modeset_commit+0x2c/0x54 [drm]
>>>> [ 2030.380865]  __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
>>>> [ 2030.380915]  drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
>>>> [ 2030.380965]  msm_fbdev_client_hotplug+0x28/0xc8 [msm]
>>>> [ 2030.381081]  drm_client_dev_hotplug+0x94/0x118 [drm]
>>>> [ 2030.381192]  output_poll_execute+0x214/0x26c [drm_kms_helper]
>>>> [ 2030.381241]  process_scheduled_works+0x19c/0x2cc
>>>> [ 2030.381249]  worker_thread+0x290/0x3cc
>>>> [ 2030.381255]  kthread+0xfc/0x184
>>>> [ 2030.381260]  ret_from_fork+0x10/0x20
>>>>
>>>> The above could happen if the convoluted hotplug state machine breaks
>>>> down so that the device is runtime suspended before
>>>> dp_bridge_atomic_disable() is called.
>>
>>> Yes, state machine got broken and I have explained how in the commit
>>> text of the RFC. But its not necessarily due to PM runtime but a
>>> sequence of events happening from userspace exposing this breakage.
>>
>> After looking at this some more today, I agree with you. The
>> observations I've reported are consistent with what would happen if the
>> link clock is disabled when dp_bridge_atomic_disable() is called.
>>
>> That clock is disabled in dp_bridge_atomic_post_disable() before runtime
>> suspending, but perhaps there are some further paths that can end up
>> disabling it.
> 
> Turns out there are paths like that and we hit those more often before
> reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().
> 
> Specifically, when a previous connect attempt did not enable the bridge
> fully so that it is still in the ST_MAINLINK_READY when we receive a
> disconnect event, dp_hpd_unplug_handle() will turn of the link clock.
> 
> [  204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
> [  204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
> [  204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link
> 
> A racing connect event, such as the one I described earlier, can then
> try to enable the bridge again but dp_bridge_atomic_enable() just bails
> out early (and leaks a rpm reference) because we're now in
> ST_DISCONNECTED:
> 
> [  204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
> [  204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected
> [  204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0
> [  204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1
> [  204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1
> [  204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable
> [  204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?)
> 
> Clearing link_ready already in dp_display_notify_disconnect() would make
> the race window slightly smaller, but it would essentially just paper
> over the bug as the events are still not serialised. Notably, there is
> no user space interaction involved here and it's the spurious connect
> event that triggers the bridge enable.
> 

Yes, it only narrows down the race condition window. The issue can still 
happen if the commit / modeset was issued before we marked link_ready as 
false.

And yes, I was only targetting a short term fix till we rework the HPD. 
That will happen only incrementally as its a delicate piece of code.

> When the fbdev hotplug code later disables the never-enabled bridge,
> things go boom:
> 
> [  204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
> [  204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected
> [  204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable
> 
> as the link clock has already been disabled when accessing the link
> registers.
> 
> The stack trace for the bridge enable above is:
> 
> [  204.553922]  dp_bridge_atomic_enable+0x40/0x2f0 [msm]
> [  204.555241]  drm_atomic_bridge_chain_enable+0x54/0xc8 [drm]
> [  204.556557]  drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper]
> [  204.557853]  msm_atomic_commit_tail+0x204/0x804 [msm]
> [  204.559173]  commit_tail+0xa4/0x18c [drm_kms_helper]
> [  204.560450]  drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper]
> [  204.561743]  drm_atomic_commit+0xa4/0xdc [drm]
> [  204.563065]  drm_client_modeset_commit_atomic+0x22c/0x298 [drm]
> [  204.564402]  drm_client_modeset_commit_locked+0x60/0x1c0 [drm]
> [  204.565733]  drm_client_modeset_commit+0x30/0x58 [drm]
> [  204.567055]  __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper]
> [  204.568381]  drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper]
> [  204.569708]  drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper]
> [  204.571032]  msm_fbdev_client_hotplug+0x28/0xd4 [msm]
> [  204.572395]  drm_client_dev_hotplug+0xcc/0x130 [drm]
> [  204.573755]  drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper]
> [  204.575114]  drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper]
> [  204.576465]  drm_bridge_hpd_notify+0x40/0x5c [drm]
> [  204.577842]  drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge]
> [  204.579184]  pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode]
> 
>>>> For some reason, possibly due to unrelated changes in timing, possibly
>>>> after the hotplug revert, I am no longer able to reproduce the reset
>>>> with 6.8-rc7 on the X13s.
>>
>>> I do not know how the hotplug revert fixed this stack because I think
>>> this can still happen.
> 
> So, while it may still be theoretically possible to hit the resets after
> the revert, the HPD notify revert effectively "fixed" the regression in
> 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> half-initialised bridge).
> 

Not entirely. In the bug which was reported before the patch 
e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a 
classic example of how this issue can happen with userspace involvement 
and not just fbdev client which was your case.

> It seems the hotplug state machine needs to be reworked completely, but
> at least we're roughly back where we were with 6.7 (including that the
> bus clocks will never be turned of because of the rpm leaks on
> disconnect).
> 

Yes, we already landed that revert but I am also planning to land the 
patch I had posted till we rework HPD.

> Johan