[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1f31afce-6807-5d33-15e2-78bc42894764@linaro.org>
Date: Fri, 26 Aug 2022 11:19:49 +0300
From: Dmitry Baryshkov <dmitry.baryshkov@...aro.org>
To: Abhinav Kumar <quic_abhinavk@...cinc.com>
Cc: Kuogee Hsieh <quic_khsieh@...cinc.com>, robdclark@...il.com,
sean@...rly.run, swboyd@...omium.org, dianders@...omium.org,
vkoul@...nel.org, daniel@...ll.ch, airlied@...ux.ie,
agross@...nel.org, bjorn.andersson@...aro.org,
quic_sbillaka@...cinc.com, linux-arm-msm@...r.kernel.org,
linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
quic_aravindh@...cinc.com, freedreno@...ts.freedesktop.org
Subject: Re: [PATCH] drm/msm/dp: add atomic_check to bridge ops
On 24/08/2022 22:16, Abhinav Kumar wrote:
>
>
> On 8/24/2022 1:25 AM, Dmitry Baryshkov wrote:
>> On Wed, 24 Aug 2022 at 01:59, Abhinav Kumar
>> <quic_abhinavk@...cinc.com> wrote:
>>>
>>>
>>>
>>> On 8/23/2022 3:41 PM, Dmitry Baryshkov wrote:
>>>> On Wed, 24 Aug 2022 at 01:07, Abhinav Kumar
>>>> <quic_abhinavk@...cinc.com> wrote:
>>>>> On 8/22/2022 11:33 AM, Dmitry Baryshkov wrote:
>>>>>> On 22/08/2022 20:32, Abhinav Kumar wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/22/2022 9:49 AM, Dmitry Baryshkov wrote:
>>>>>>>> On 22/08/2022 19:38, Abhinav Kumar wrote:
>>>>>>>>> Hi Dmitry
>>>>>>>>>
>>>>>>>>> On 8/22/2022 9:18 AM, Dmitry Baryshkov wrote:
>>>>>>>>>> On 17/08/2022 21:01, Kuogee Hsieh wrote:
>>>>>>>>>>> DRM commit_tails() will disable downstream
>>>>>>>>>>> crtc/encoder/bridge if
>>>>>>>>>>> both disable crtc is required and crtc->active is set before
>>>>>>>>>>> pushing
>>>>>>>>>>> a new frame downstream.
>>>>>>>>>>>
>>>>>>>>>>> There is a rare case that user space display manager issue an
>>>>>>>>>>> extra
>>>>>>>>>>> screen update immediately followed by close DRM device while
>>>>>>>>>>> down
>>>>>>>>>>> stream display interface is disabled. This extra screen
>>>>>>>>>>> update will
>>>>>>>>>>> timeout due to the downstream interface is disabled but will
>>>>>>>>>>> cause
>>>>>>>>>>> crtc->active be set. Hence the followed commit_tails() called by
>>>>>>>>>>> drm_release() will pass the disable downstream
>>>>>>>>>>> crtc/encoder/bridge
>>>>>>>>>>> conditions checking even downstream interface is disabled.
>>>>>>>>>>> This cause the crash to happen at dp_bridge_disable() due to it
>>>>>>>>>>> trying
>>>>>>>>>>> to access the main link register to push the idle pattern out
>>>>>>>>>>> while main
>>>>>>>>>>> link clocks is disabled.
>>>>>>>>>>>
>>>>>>>>>>> This patch adds atomic_check to prevent the extra frame will not
>>>>>>>>>>> be pushed down if display interface is down so that crtc->active
>>>>>>>>>>> will not be set neither. This will fail the conditions checking
>>>>>>>>>>> of disabling down stream crtc/encoder/bridge which prevent
>>>>>>>>>>> drm_release() from calling dp_bridge_disable() so that crash
>>>>>>>>>>> at dp_bridge_disable() prevented.
>>>>>>>>>>
>>>>>>>>>> I must admit I had troubles parsing this description. However
>>>>>>>>>> if I
>>>>>>>>>> got you right, I think the check that the main link clock is
>>>>>>>>>> running in the dp_bridge_disable() or dp_ctrl_push_idle()
>>>>>>>>>> would be
>>>>>>>>>> a better fix.
>>>>>>>>>
>>>>>>>>> Originally, thats what was posted
>>>>>>>>> https://patchwork.freedesktop.org/patch/496984/.
>>>>>>>>
>>>>>>>> This patch is also not so correct from my POV. It checks for the
>>>>>>>> hpd
>>>>>>>> status, while in reality it should check for main link clocks being
>>>>>>>> enabled.
>>>>>>>>
>>>>>>>
>>>>>>> We can push another fix to check for the clk state instead of the
>>>>>>> hpd
>>>>>>> status. But I must say we are again just masking something which the
>>>>>>> fwk should have avoided isnt it?
>>>>>>>
>>>>>>> As per the doc in the include/drm/drm_bridge.h it says,
>>>>>>>
>>>>>>> "*
>>>>>>> * The bridge can assume that the display pipe (i.e. clocks
>>>>>>> and timing
>>>>>>> * signals) feeding it is still running when this callback is
>>>>>>> called.
>>>>>>> *"
>>>>>>
>>>>>> Yes, that's what I meant about this chunk begging to go to the
>>>>>> core. In
>>>>>> my opinion, if we are talking about the disconnected sinks, it is the
>>>>>> framework who should disallow submitting the frames to the
>>>>>> disconnected
>>>>>> sinks.
>>>>>>
>>>>>>>
>>>>>>> By adding an extra layers of protection in the driver, we are just
>>>>>>> avoiding another issue but the commit should not have been issued in
>>>>>>> the first place.
>>>>>>>
>>>>>>> So shouldnt we do both then? That is add protection to check if
>>>>>>> clock
>>>>>>> is ON and also, reject commits when display is disconnected.
>>>>>>>
>>>>>>>>>
>>>>>>>>> Then it seemed like we were just protecting against an issue in
>>>>>>>>> the
>>>>>>>>> framework which was allowing the frames to be pushed even after
>>>>>>>>> the
>>>>>>>>> display was disconnected. The DP driver did send out the
>>>>>>>>> disconnect
>>>>>>>>> event correctly and as per the logs, this frame came down after
>>>>>>>>> that
>>>>>>>>> and the DRM fwk did allow it.
>>>>>>>>>
>>>>>>>>> So after discussing on IRC with Rob, we came up with this
>>>>>>>>> approach that
>>>>>>>>> if the display is not connected, then atomic_check should fail.
>>>>>>>>> That
>>>>>>>>> way the commit will not happen.
>>>>>>>>>
>>>>>>>>> Just seemed a bit cleaner instead of adding all our protections.
>>>>>>>>
>>>>>>>> The check to fail atomic_check if display is not connected seems
>>>>>>>> out
>>>>>>>> of place. In its current way it begs go to the upper layer,
>>>>>>>> forbidding using disconnected sinks for all the drivers. There is
>>>>>>>> nothing special in the MSM DP driver with respect to the HPD events
>>>>>>>> processing and failing atomic_check() based on that.
>>>>>>>>
>>>>>>>
>>>>>>> Why all the drivers? This is only for MSM DP bridge.
>>>>>>
>>>>>> Yes, we change the MSM DRM driver. But the check is generic
>>>>>> enough. I'm
>>>>>> not actually insisting on pushing the check to the core, just
>>>>>> trying to
>>>>>> understand the real cause here.
>>>>>>
>>>>>>>
>>>>>
>>>>> I actually wanted to push this to the core and thats what I had
>>>>> originally asked on IRC because it does seem to be generic enough that
>>>>> it should belong to the core but after discussion with Rob on
>>>>> freedreno,
>>>>> he felt this was a better approach because for some of the legacy
>>>>> connectors like VGA, this need not belong to the DRM core, hence we
>>>>> went
>>>>> with this approach.
>>>>
>>>> It might be better to whitelist such connectors (S-VIDEO/composite
>>>> comes to my mind rather than VGA).
>>>
>>> I am fine with that approach, if Rob is onboard with that.
>>>
>>>>
>>>>>>>>>>> SError Interrupt on CPU7, code 0x00000000be000411 -- SError
>>>>>>>>>>> CPU: 7 PID: 3878 Comm: Xorg Not tainted 5.19.0-stb-cbq #19
>>>>>>>>>>> Hardware name: Google Lazor (rev3 - 8) (DT)
>>>>>>>>>>> pstate: a04000c9 (NzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>>>> pc : __cmpxchg_case_acq_32+0x14/0x2c
>>>>>>>>>>> lr : do_raw_spin_lock+0xa4/0xdc
>>>>>>>>>>> sp : ffffffc01092b6a0
>>>>>>>>>>> x29: ffffffc01092b6a0 x28: 0000000000000028 x27:
>>>>>>>>>>> 0000000000000038
>>>>>>>>>>> x26: 0000000000000004 x25: ffffffd2973dce48 x24:
>>>>>>>>>>> 0000000000000000
>>>>>>>>>>> x23: 00000000ffffffff x22: 00000000ffffffff x21:
>>>>>>>>>>> ffffffd2978d0008
>>>>>>>>>>> x20: ffffffd2978d0008 x19: ffffff80ff759fc0 x18:
>>>>>>>>>>> 0000000000000000
>>>>>>>>>>> x17: 004800a501260460 x16: 0441043b04600438 x15:
>>>>>>>>>>> 04380000089807d0
>>>>>>>>>>> x14: 07b0089807800780 x13: 0000000000000000 x12:
>>>>>>>>>>> 0000000000000000
>>>>>>>>>>> x11: 0000000000000438 x10: 00000000000007d0 x9 :
>>>>>>>>>>> ffffffd2973e09e4
>>>>>>>>>>> x8 : ffffff8092d53300 x7 : ffffff808902e8b8 x6 :
>>>>>>>>>>> 0000000000000001
>>>>>>>>>>> x5 : ffffff808902e880 x4 : 0000000000000000 x3 :
>>>>>>>>>>> ffffff80ff759fc0
>>>>>>>>>>> x2 : 0000000000000001 x1 : 0000000000000000 x0 :
>>>>>>>>>>> ffffff80ff759fc0
>>>>>>>>>>> Kernel panic - not syncing: Asynchronous SError Interrupt
>>>>>>>>>>> CPU: 7 PID: 3878 Comm: Xorg Not tainted 5.19.0-stb-cbq #19
>>>>>>>>>>> Hardware name: Google Lazor (rev3 - 8) (DT)
>>>>>>>>>>> Call trace:
>>>>>>>>>>> dump_backtrace.part.0+0xbc/0xe4
>>>>>>>>>>> show_stack+0x24/0x70
>>>>>>>>>>> dump_stack_lvl+0x68/0x84
>>>>>>>>>>> dump_stack+0x18/0x34
>>>>>>>>>>> panic+0x14c/0x32c
>>>>>>>>>>> nmi_panic+0x58/0x7c
>>>>>>>>>>> arm64_serror_panic+0x78/0x84
>>>>>>>>>>> do_serror+0x40/0x64
>>>>>>>>>>> el1h_64_error_handler+0x30/0x48
>>>>>>>>>>> el1h_64_error+0x68/0x6c
>>>>>>>>>>> __cmpxchg_case_acq_32+0x14/0x2c
>>>>>>>>>>> _raw_spin_lock_irqsave+0x38/0x4c
>>>>>>
>>>>>> You know, after re-reading the trace, I could not help but notice
>>>>>> that
>>>>>> the issue seems to be related to completion/timer/spinlock memory
>>>>>> becoming unavailable rather than disabling the main link clock.
>>>>>> See, the SError comes in the spin_lock path, not during register
>>>>>> read.
>>>>>>
>>>>>> Thus I think the commit message is a bit misleading.
>>>>>>
>>>>>
>>>>> No, this issue is due to unclocked access. Please check this part
>>>>> of the
>>>>> stack:
>>>>
>>>> Well, if it were for the unlocked access, we would see SError on the
>>>> register access, wouldn't we? However in this case the SError comes
>>>> from the raw spinlock code.
>>>
>>> This is not uncommon. With unclocked access, we have seen in the past
>>> that sometimes the stack is off by one line. The fact that this issue
>>> got resolved even with the older version of the patch
>>> https://patchwork.freedesktop.org/patch/496984/ is pointing towards an
>>> unclocked access and not the dp/dp->ctrl memory pointers.
>>
>> As far as I understood, the bug is reproducible. Just to make me feel
>> safe, can we please:
>> - either have a trace which shows when the clocks are disabled (or
>> not enabled)
>> - or make sure that keeping the mainlink clock on would also mitigate
>> the issue?
>
> Yes, this trace is already available with all the drm_dbg_dp messages
> enabled. Please refer to the attachment named
> 2022-08-15-dmesg-drm-4K-crash.txt in the bug
> https://gitlab.freedesktop.org/drm/msm/-/issues/17.
>
> You can jump to this section of the log.
>
> [ 99.191216] msm_dpu ae01000.mdp: [drm:dp_display_host_phy_exit]
> type=10 core_init=1 phy_init=1
> [ 99.192354] [drm:dp_ctrl_phy_exit] phy=00000000b9b91350 init=0
> power_on=0
> [ 99.192369] msm_dpu ae01000.mdp:
> [drm:dp_display_disable.constprop.0.isra.0] sink count: 1
>
> Here is the dp_display_disable() you were looking for.
>
> [ 99.192378] msm_dpu ae01000.mdp: [drm:dp_bridge_post_disable] type=10
> Done
> [ 99.192389] msm_dpu ae01000.mdp:
> [drm:drm_atomic_helper_commit_modeset_disables] disabling [CRTC:60:crtc-1]
> [ 99.192561] [drm:dpu_crtc_disable] no frames pending
> [ 99.192571] [drm:dpu_core_perf_crtc_update] crtc:60 stop_req:1
> core_clk:200000000
> [ 99.192581] [drm:dpu_core_perf_crtc_update] crtc=60 disable
> [ 99.192588] [drm:_dpu_core_perf_crtc_update_bus] crtc=59 bw=0 paths:1
> [ 99.192595] [drm:_dpu_core_perf_crtc_update_bus] crtc=60 bw=0 paths:1
> [ 99.192700] [drm:dpu_core_perf_crtc_update] clk:200000000
> [ 99.192714] [drm:dpu_core_perf_crtc_update] update clk rate =
> 200000000 HZ
> [ 99.192729] msm_dpu ae01000.mdp:
> [drm:drm_atomic_helper_commit_modeset_disables] modeset on
> [ENCODER:33:TMDS-33]
> [ 99.192738] [drm:dpu_encoder_virt_atomic_mode_set] enc33
> [ 99.192749] [drm:dpu_crtc_atomic_begin] crtc59
>>
>>>
>>>>
>>>>> >>>>>> wait_for_completion_timeout+0x2c/0x54
>>>>> >>>>>> dp_ctrl_push_idle+0x40/0x88
>>>>> >>>>>> dp_bridge_disable+0x24/0x30
>>>>> >>>>>> drm_atomic_bridge_chain_disable+0x90/0xbc
>>>>> >>>>>> drm_atomic_helper_commit_modeset_disables+0x198/0x444
>>>>> >>>>>> msm_atomic_commit_tail+0x1d0/0x374
>>>>> >>>>>> commit_tail+0x80/0x108
>>>>> >>>>>> drm_atomic_helper_commit+0x118/0x11c
>>>>> >>>>>> drm_atomic_commit+0xb4/0xe0
>>>>> >>>>>> drm_client_modeset_commit_atomic+0x184/0x224
>>>>> >>>>>> drm_client_modeset_commit_locked+0x58/0x160
>>>>> >>>>>> drm_client_modeset_commit+0x3c/0x64
>>>>>
>>>>>> Can we please get a trace checking which calls were actually made for
>>>>>> the dp bridge and if the dp/dp->ctrl memory pointers are correct?
>>>>>>
>>>>>> I do not see the dp_display_disable() being called. Maybe I just
>>>>>> missed
>>>>>> the call.
>>>>>>
>>>>>
>>>>> Yes it is called, please refer to the above part of the stack that I
>>>>> have pasted.
>>>>
>>>> The stacktrace mentions dp_bridge_disable(), not dp_display_disable()
>>>> (which I asked for).
>>>>
>>>
>>> So whats happening here is the crash is happening in
>>> dp_bridge_disable().
>>>
>>> dp_display_disable() is called from post_disable() thats why it doesnt
>>> show up in the stack.
>>>
>>
>> Yes. But the mainlink clocks are disabled in dp_display_disable()
>> that's why I'm asking if the function was called at all.
>
> Now, I see why you were asking about dp_display_disable(). So basically
> your question is that when did dp_display_disable() happen that disabled
> the clocks causing this issue.
>
> dp_display_disable() happened when the cable was disconnected as shown
> in the above section of the logs.
>
> We also sent the disconnected uevent to the usermode. But this commit is
> happening from the drm_lastclose() context which doesnt check the
> connection status.
>
> This leads to a commit after the cable has been disconnected causing the
> unclocked access.
>
> You can refer this log and comment if something is still not clear to you.
I have spent some time comparing the log and the programming logic.
I found what I was looking for: a safeguard for not doing the disable
twice. The disable_outputs() function, the one which calls
drm_atomic_bridge_chain_disable() has a logical check which should have
acted as a safety net here: it checks whether crtc_needs_disable().
Can you please doublecheck why doesn't it reflect the fact that CRTC
doesn't need disabling as it has been already disabled. If I understand
correctly this boils down to CRTC's old_state->active being set, while
the CRTC has been effectively disabled.
--
With best wishes
Dmitry
Powered by blists - more mailing lists