[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f8b51be2-9121-4313-a575-d7aea330b687@linaro.org>
Date: Fri, 29 Nov 2024 10:44:09 +0200
From: Vladimir Zapolskiy <vladimir.zapolskiy@...aro.org>
To: barnabas.czeman@...nlining.org
Cc: Yassine Oudjana <y.oudjana@...tonmail.com>, Robert Foss
<rfoss@...nel.org>, Todor Tomov <todor.too@...il.com>,
Bryan O'Donoghue <bryan.odonoghue@...aro.org>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Konrad Dybcio <konradybcio@...nel.org>, Hans Verkuil <hverkuil@...all.nl>,
linux-media@...r.kernel.org, linux-arm-msm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] media: qcom: camss: fix VFE pm domain off
On 11/28/24 16:24, barnabas.czeman@...nlining.org wrote:
> On 2024-11-28 14:22, Vladimir Zapolskiy wrote:
>> On 11/28/24 12:27, barnabas.czeman@...nlining.org wrote:
>>> On 2024-11-28 10:10, Vladimir Zapolskiy wrote:
>>>> On 11/27/24 12:01, Yassine Oudjana wrote:
>>>>> On 22/11/2024 5:06 am, Barnabás Czémán wrote:
>>>>>> Fix NULL pointer check before device_link_del
>>>>>> is called.
>>>>
>>>> The intention is clear, but the context of the change is completely
>>>> lost.
>>>>
>>>>>> Fixes: eb73facec2c2 ("media: qcom: camss: Use common VFE
>>>>>> pm_domain_on/pm_domain_off where applicable")
>>>>
>>>> It's invalid, the change is not a fix.
>>> I don't agree this patch is fixing NULL pointer dereference.
>>>
>>> [ 92.989120] Unable to handle kernel NULL pointer dereference at
>>> virtual address 000000000000032c
>>> [ 92.989170] Mem abort info:
>>> [ 92.989186] ESR = 0x0000000096000004
>>> [ 92.989203] EC = 0x25: DABT (current EL), IL = 32 bits
>>> [ 92.989221] SET = 0, FnV = 0
>>> [ 92.989237] EA = 0, S1PTW = 0
>>> [ 92.989253] FSC = 0x04: level 0 translation fault
>>> [ 92.989270] Data abort info:
>>> [ 92.989284] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>>> [ 92.989300] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>>> [ 92.989317] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>>> [ 92.989335] user pgtable: 4k pages, 48-bit VAs,
>>> pgdp=00000001218a8000
>>> [ 92.989354] [000000000000032c] pgd=0000000000000000,
>>> p4d=0000000000000000
>>> [ 92.989389] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
>>> [ 92.989408] Modules linked in: q6afe_dai q6asm_dai q6routing q6adm
>>> q6asm q6afe snd_q6dsp_common panel_lgphilips_sw43101 q6core venus_enc
>>> venus_dec videobuf2_dma_contig imx318 ak7375 snd_soc_wcd9335
>>> regmap_slimbus snd_soc_wcd_classh apr snd_soc_apq8096
>>> snd_soc_qcom_common snd_soc_core qcom_camss msm v4l2_fwnode
>>> snd_compress
>>> ath10k_pci v4l2_async ath10k_core snd_pcm nxp_nci_i2c drm_exec nxp_nci
>>> venus_core videobuf2_dma_sg snd_timer ath v4l2_mem2mem
>>> videobuf2_memops
>>> mac80211 drm_dp_aux_bus snd gpu_sched nci videobuf2_v4l2 libarc4
>>> soundcore videodev nfc slim_qcom_ngd_ctrl drm_display_helper hci_uart
>>> pdr_interface videobuf2_common btqca drm_kms_helper slimbus
>>> i2c_qcom_cci
>>> bluetooth mc qcom_q6v5_pas qcom_q6v5_mss qcom_pil_info qcom_q6v5
>>> qcom_sysmon qcom_common qmi_helpers mdt_loader socinfo rmtfs_mem
>>> pwm_ir_tx cfg80211 rfkill zram zsmalloc atmel_mxt_ts drm
>>> drm_panel_orientation_quirks dm_mod ip_tables
>>> [ 92.989981] CPU: 2 PID: 1365 Comm: pool-megapixels Not tainted
>>> 6.9.0-rc3+ #10
>>> [ 92.990003] Hardware name: Xiaomi Mi Note 2 (DT)
>>> [ 92.990020] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [ 92.990042] pc : device_link_put_kref+0xc/0xb8
>>> [ 92.990071] lr : device_link_del+0x30/0x48
>>> [ 92.990089] sp : ffff80008a5db9d0
>>> [ 92.990105] x29: ffff80008a5db9d0 x28: 0000000000000001 x27:
>>> 0000000000000000
>>> [ 92.990143] x26: 0000000000000000 x25: ffff0000e79d9100 x24:
>>> ffff0000e79d9500
>>> [ 92.990180] x23: ffff0000943f8568 x22: 00000000ffffffff x21:
>>> 0000000000000000
>>> [ 92.990217] x20: 0000000000000000 x19: ffff800081352498 x18:
>>> 0000000000000000
>>> [ 92.990253] x17: 0000000000000000 x16: 0000000000000000 x15:
>>> 0000000000000168
>>> [ 92.990288] x14: 0000000000000000 x13: 0000000000000191 x12:
>>> ffff800081259d58
>>> [ 92.990324] x11: 0000000000000001 x10: 0000000000000a60 x9 :
>>> ffff80008a5db7e0
>>> [ 92.990359] x8 : ffff0000e79d9bc0 x7 : 0000000000000004 x6 :
>>> 0000000000000190
>>> [ 92.990396] x5 : 0000000000000057 x4 : 0000000000000000 x3 :
>>> 0000000000000000
>>> [ 92.990430] x2 : ffff0000e79d9100 x1 : 0000000000000000 x0 :
>>> 0000000000000000
>>> [ 92.990466] Call trace:
>>> [ 92.990482] device_link_put_kref+0xc/0xb8
>>> [ 92.990503] device_link_del+0x30/0x48
>>> [ 92.990522] vfe_pm_domain_off+0x24/0x38 [qcom_camss]
>>
>> vfe_pm_domain_off() shall not be called before vfe_pm_domain_on() call.
>>
>> If vfe_pm_domain_on() is called and returns failure, then a media
>> pipeline
>> shall not be started, and vfe_pm_domain_off() shall not be called.
>>
>> If vfe_pm_domain_on() is called and returns success, then
>> vfe->genpd_link
>> is not NULL.
> It can be null if the pm_domain_off is called twice somehow or it is
> called
> before pm_domain_on.
Then this is the problem and this not yet identified problem shall be fixed.
The currently proposed change does not fix it, but hides more efficiently.
> This is the original function, it sets genpd_link to NULL:
>
> void vfe_pm_domain_off(struct vfe_device *vfe)
> {
> if (!vfe->genpd)
> return;
>
> device_link_del(vfe->genpd_link);
> vfe->genpd_link = NULL;
> }
> Other possible case:
> genpd_link can be NULL when pm_domain_on is failing or
> when genpd is NULL
> By the way pm_domain_on checks if genpd_link is NULL:
> vfe->genpd_link = device_link_add(camss->dev, vfe->genpd,
> DL_FLAG_STATELESS |
> DL_FLAG_PM_RUNTIME |
> DL_FLAG_RPM_ACTIVE);
> if (!vfe->genpd_link)
> return -EINVAL;
> It is not calling pm_domain_off on fail:
>
> ret = vfe->res->hw_ops->pm_domain_on(vfe);
> if (ret < 0)
> goto error_pm_domain;
>
> [...]
>
> error_domain_off:
> vfe->res->hw_ops->pm_domain_off(vfe);
>
> error_pm_domain:
> mutex_unlock(&vfe->power_lock);
>
>
> camss_pm_domain_off also calls vfe_pm_domain_off what is used in
> camss-ispif
> void camss_pm_domain_off(struct camss *camss, int id)
> {
> if (id < camss->res->vfe_num) {
> struct vfe_device *vfe = &camss->vfe[id];
>
> vfe->res->hw_ops->pm_domain_off(vfe);
> }
> }
>>
>> Are there any perceptable flaws within the given above reasoning?
> The camss will be broken and system also can recover correctly (only
> force reboot works.).
Well, my question was about the reasoning, why there is another
problem.
I'm worried that the real bug will not be fixed by this change, and
I believe additional analysis is needed here.
>>
>> Since you've encountered a bug and taking the reasoning from above as
>> correct, I believe the bug is present somewhere else, and if so, it
>> will
>> remain unfixed by this change.
>>
>>> [ 92.990566] vfe_put+0x9c/0xd0 [qcom_camss]
>>> [ 92.990601] vfe_set_power+0x48/0x58 [qcom_camss]
>>> [ 92.990636] pipeline_pm_power_one+0x154/0x158 [videodev]
>>> [ 92.990683] pipeline_pm_power+0x74/0xfc [videodev]
>>> [ 92.990720] v4l2_pipeline_pm_use+0x54/0x90 [videodev]
>>> [ 92.990757] v4l2_pipeline_pm_put+0x14/0x34 [videodev]
>>> [ 92.990793] video_release+0x2c/0x44 [qcom_camss]
>>> [ 92.990828] v4l2_release+0xe4/0xec [videodev]
>>
>> Please include the backtrace up to this point into the commit message.
>>
--
Best wishes,
Vladimir
Powered by blists - more mailing lists