[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87il227yqv.fsf@baylibre.com>
Date: Mon, 04 Mar 2024 17:22:48 +0100
From: Mattijs Korpershoek <mkorpershoek@...libre.com>
To: Nicolas Dufresne <nicolas.dufresne@...labora.com>, Nas Chung
<nas.chung@...psnmedia.com>, Jackson Lee <jackson.lee@...psnmedia.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>
Cc: Guillaume La Roque <glaroque@...libre.com>, Brandon Brnich
<b-brnich@...com>, Robert Beckett <bob.beckett@...labora.com>, Sebastian
Fricke <sebastian.fricke@...labora.com>, linux-media@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] media: chips-media: wave5: Call v4l2_m2m_job_finish()
in job_abort()
Hi Nicolas,
Thank you for your prompt reply.
On ven., mars 01, 2024 at 09:01, Nicolas Dufresne <nicolas.dufresne@...labora.com> wrote:
> Hi,
>
> Le mercredi 28 février 2024 à 17:12 +0100, Mattijs Korpershoek a écrit :
>> When aborting a stream, it's possible that v4l2_m2m_cancel_job()
>> remains stuck:
>>
>> [ 3498.490038][ T1] sysrq: Show Blocked State
>> [ 3498.511754][ T1] task:V4L2DecodeCompo state:D stack:12480 pid:2387 ppid:1 flags:0x04000809
>> [ 3498.521153][ T1] Call trace:
>> [ 3498.524333][ T1] __switch_to+0x174/0x338
>> [ 3498.528611][ T1] __schedule+0x5ec/0x9cc
>> [ 3498.532795][ T1] schedule+0x84/0xf0
>> [ 3498.536630][ T1] v4l2_m2m_cancel_job+0x118/0x194
>> [ 3498.541595][ T1] v4l2_m2m_streamoff+0x34/0x13c
>> [ 3498.546384][ T1] v4l2_m2m_ioctl_streamoff+0x20/0x30
>> [ 3498.551607][ T1] v4l_streamoff+0x44/0x54
>> [ 3498.555877][ T1] __video_do_ioctl+0x388/0x4dc
>> [ 3498.560580][ T1] video_usercopy+0x618/0xa0c
>> [ 3498.565109][ T1] video_ioctl2+0x20/0x30
>> [ 3498.569292][ T1] v4l2_ioctl+0x74/0x8c
>> [ 3498.573300][ T1] __arm64_sys_ioctl+0xb0/0xec
>> [ 3498.577918][ T1] invoke_syscall+0x60/0x124
>> [ 3498.582368][ T1] el0_svc_common+0x90/0xfc
>> [ 3498.586724][ T1] do_el0_svc+0x34/0xb8
>> [ 3498.590733][ T1] el0_svc+0x2c/0xa4
>> [ 3498.594480][ T1] el0t_64_sync_handler+0x68/0xb4
>> [ 3498.599354][ T1] el0t_64_sync+0x1a4/0x1a8
>> [ 3498.603832][ T1] sysrq: Kill All Tasks
>>
>> According to job_abort() documentation from v4l2_m2m_ops:
>>
>> After the driver performs the necessary steps, it has to call
>> v4l2_m2m_job_finish() or v4l2_m2m_buf_done_and_job_finish() as
>> if the transaction ended normally.
>>
>> This is not done in wave5_vpu_dec_job_abort(). Neither switch_state() nor
>> wave5_vpu_dec_set_eos_on_firmware() does this.
>
> The doc said "the driver", not job_abort() specifically ...
Indeed. Seems I wanted to convince myself that this was job_abort()
specific. Sorry about that.
>
>>
>> Add the missing call to fix the v4l2_m2m_cancel_job() hangs.
>>
>> Fixes: 9707a6254a8a ("media: chips-media: wave5: Add the v4l2 layer")
>> Signed-off-by: Mattijs Korpershoek <mkorpershoek@...libre.com>
>> ---
>> This has been tested on the AM62Px SK EVM using Android 14.
>> See:
>> https://www.ti.com/tool/PROCESSOR-SDK-AM62P
>>
>> Note: while this is has not been tested on an upstream linux tree, I
>> believe the fix is still valid for mainline and I would like to get
>> some feedback on it.
>>
>> Thank you in advance for your time.
>> ---
>> drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> index ef227af72348..b03e3633a1bc 100644
>> --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
>> @@ -1715,6 +1715,7 @@ static void wave5_vpu_dec_device_run(void *priv)
>> static void wave5_vpu_dec_job_abort(void *priv)
>> {
>> struct vpu_instance *inst = priv;
>> + struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
>> int ret;
>>
>> ret = switch_state(inst, VPU_INST_STATE_STOP);
>> @@ -1725,6 +1726,8 @@ static void wave5_vpu_dec_job_abort(void *priv)
>> if (ret)
>> dev_warn(inst->dev->dev,
>> "Setting EOS for the bitstream, fail: %d\n", ret);
>> +
>> + v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
>
> Wave5 firmware have no function to cancel pending jobs. By finishing the job
> without caring about the firmware state, wave5_vpu_dec_finish_decode() will be
> called concurrently to another job. This change will effectively breaks seeking,
> and will cause warning and possibly memory corruption.
Ah. That's not what I intended. This patch would completely break the
driver.
Thank you for explaining that to me.
>
> In principle, setting the EOS flag should be enough to ensure that the drain
> sequence is initiated, and that should allow finish_decoder() to be called,
> which is the only clean way to get finish_job() to be called.
>
> This driver implementation uses PIC_END operating mode of the IP, that ensure
> that each submitted job is assumed to be complete, which means each RUN will
> lead to a matching IRQ. We had a similar stall during development which was
> fixed with a firmware update, are you sure you have the very latest firmware ?
Interesting.
I double checked the firmware from linux-firmware:
$ ~/work/upstream/linux-firmware/ main md5sum cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a cnm/wave521c_k3_codec_fw.bin
$ ~/ adb shell md5sum /vendor/firmware/cnm/wave521c_k3_codec_fw.bin
02c5faa5405559bd59a7361a32c2695a /vendor/firmware/cnm/wave521c_k3_codec_fwbin
Which should be: "FW version : 1.0.3"
> Any chance you can use v4l2-tracer to share what your software have been doing ?
I did not know v4l2-tracer.
I tried to run it on android but it seems it segfaults when overriding
mmap().
I will continue to try to get some traces.
>
> What you can though though, to fortify this a little, is to introduce a watchdog
> here. You can trigger it from abort, and on timeout, you will have to fully
> reset and reboot the chip (which is not very fast, you don't want to do this at
> all time). When the reset have completed, you will have to carefully reset the
> driver state before you can safely continue. You'll need to add thread safe
> protection against spurious finish_decode() call, in case the watchdog timeout
> raced with the firmware finally coming back to life.
Ok, I can look into that as well.
Given that i'm not super familiar with multimedia, this has helped me a
lot.
Thanks!
>
> regards,
> Nicolas
>
> p.s. you should perhaps trace the firmware job counters to try and understand
> more about your specific hang.
I will look into it.
>
>> }
>>
>> static int wave5_vpu_dec_job_ready(void *priv)
>>
>> ---
>> base-commit: 8c64f4cdf4e6cc5682c52523713af8c39c94e6d5
>> change-id: 20240228-wave5-fix-abort-f72d25881cbd
>>
>> Best regards,
Powered by blists - more mailing lists