linux-kernel - RE: [PATCH v2 2/7] media: chips-media: wave5: Improve performance of decoder

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SE1P216MB13033207BDFE2A6BCC48999EED6CA@SE1P216MB1303.KORP216.PROD.OUTLOOK.COM>
Date: Wed, 4 Jun 2025 04:09:53 +0000
From: jackson.lee <jackson.lee@...psnmedia.com>
To: Nicolas Dufresne <nicolas.dufresne@...labora.com>, "mchehab@...nel.org"
	<mchehab@...nel.org>, "hverkuil-cisco@...all.nl" <hverkuil-cisco@...all.nl>,
	"bob.beckett@...labora.com" <bob.beckett@...labora.com>
CC: "linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, lafley.kim
	<lafley.kim@...psnmedia.com>, "b-brnich@...com" <b-brnich@...com>,
	"hverkuil@...all.nl" <hverkuil@...all.nl>, Nas Chung
	<nas.chung@...psnmedia.com>
Subject: RE: [PATCH v2 2/7] media: chips-media: wave5: Improve performance of
 decoder

Hi

> -----Original Message-----
> From: Nicolas Dufresne <nicolas.dufresne@...labora.com>
> Sent: Wednesday, May 28, 2025 10:47 PM
> To: jackson.lee <jackson.lee@...psnmedia.com>; mchehab@...nel.org;
> hverkuil-cisco@...all.nl; bob.beckett@...labora.com
> Cc: linux-media@...r.kernel.org; linux-kernel@...r.kernel.org; lafley.kim
> <lafley.kim@...psnmedia.com>; b-brnich@...com; hverkuil@...all.nl; Nas
> Chung <nas.chung@...psnmedia.com>
> Subject: Re: [PATCH v2 2/7] media: chips-media: wave5: Improve performance
> of decoder
> 
> Hi,
> 
> Le mardi 27 mai 2025 à 04:58 +0000, jackson.lee a écrit :
> >
> >
> > > -----Original Message-----
> > > From: Nicolas Dufresne <nicolas.dufresne@...labora.com>
> > > Sent: Saturday, May 24, 2025 2:39 AM
> > > To: jackson.lee <jackson.lee@...psnmedia.com>; mchehab@...nel.org;
> > > hverkuil-cisco@...all.nl; sebastian.fricke@...labora.com;
> > > bob.beckett@...labora.com; dafna.hirschfeld@...labora.com
> > > Cc: linux-media@...r.kernel.org; linux-kernel@...r.kernel.org;
> > > lafley.kim <lafley.kim@...psnmedia.com>; b-brnich@...com;
> > > hverkuil@...all.nl; Nas Chung <nas.chung@...psnmedia.com>
> > > Subject: Re: [PATCH v2 2/7] media: chips-media: wave5: Improve
> > > performance of decoder
> > >
> > > Hi,
> > >
> > > Le jeudi 22 mai 2025 à 16:26 +0900, Jackson.lee a écrit :
> > > > From: Jackson Lee <jackson.lee@...psnmedia.com>
> > > >
> > > > The current decoding method  was to wait until each frame was
> > > > decoded after feeding a bitstream. As a result, performance was
> > > > low and Wave5 could not achieve max pixel processing rate.
> > > >
> > > > Update driver to use an asynchronous approach for decoding and
> > > > feeding a bitstream in order to achieve full capabilities of the
> device.
> > > >
> > > > WAVE5 supports command-queueing to maximize performance by
> > > > pipelining internal commands and by hiding wait cycle taken to
> > > > receive a command from Host processor.
> > > >
> > > > Instead of waiting for each command to be executed before sending
> > > > the next command, Host processor just places all the commands in
> > > > the command-queue and goes on doing other things while the
> > > > commands in the queue are processed by VPU.
> > > >
> > > > While Host processor handles its own tasks, it can receive VPU
> > > > interrupt request (IRQ).
> > > > In this case, host processor can simply exit interrupt service
> > > > routine
> > > > (ISR) without accessing to host interface to read the result of
> > > > the command reported by VPU.
> > > > After host processor completed its tasks, host processor can read
> > > > the command result when host processor needs the reports and does
> > > > response processing.
> > > >
> > > > To archive this goal, the device_run() calls v4l2_m2m_job_finish
> > > > so that next command can be sent to VPU continuously, if there is
> > > > any result, then irq is triggered and gets decoded frames and
> > > > returns them to upper layer.
> > > > Theses processes work independently each other without waiting a
> > > > decoded frame.
> > > >
> > > > Signed-off-by: Jackson Lee <jackson.lee@...psnmedia.com>
> > > > Signed-off-by: Nas Chung <nas.chung@...psnmedia.com>
> > > > ---
> > > >  .../platform/chips-media/wave5/wave5-hw.c     |  2 +-
> > > >  .../chips-media/wave5/wave5-vpu-dec.c         | 84
> > > > +++++++++++--------
> > > >  .../platform/chips-media/wave5/wave5-vpuapi.c |  2 +
> > > >  .../platform/chips-media/wave5/wave5-vpuapi.h |  3 +
> > > >  4 files changed, 57 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/drivers/media/platform/chips-media/wave5/wave5-hw.c
> > > > b/drivers/media/platform/chips-media/wave5/wave5-hw.c
> > > > index d94cf84c3ee5..687ce6ccf3ae 100644
> > > > --- a/drivers/media/platform/chips-media/wave5/wave5-hw.c
> > > > +++ b/drivers/media/platform/chips-media/wave5/wave5-hw.c
> > > > @@ -102,7 +102,7 @@ static void _wave5_print_reg_err(struct
> > > > vpu_device
> > > *vpu_dev, u32 reg_fail_reason
> > > >  		dev_dbg(dev, "%s: queueing failure: 0x%x\n", func,
> reg_val);
> > > >  		break;
> > > >  	case WAVE5_SYSERR_RESULT_NOT_READY:
> > > > -		dev_err(dev, "%s: result not ready: 0x%x\n", func,
> > > reg_fail_reason);
> > > > +		dev_dbg(dev, "%s: result not ready: 0x%x\n", func,
> > > > +reg_fail_reason);
> > > >  		break;
> > > >  	case WAVE5_SYSERR_ACCESS_VIOLATION_HW:
> > > >  		dev_err(dev, "%s: access violation: 0x%x\n", func,
> > > > reg_fail_reason); diff --git
> > > > a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> > > > b/drivers/media/platform/chips- media/wave5/wave5-vpu-dec.c index
> > > > 32de43de1870..995234a3a6d6 100644
> > > > --- a/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> > > > +++ b/drivers/media/platform/chips-media/wave5/wave5-vpu-dec.c
> > > > @@ -347,13 +347,12 @@ static void
> > > > wave5_vpu_dec_finish_decode(struct
> > > vpu_instance *inst)
> > > >  	struct vb2_v4l2_buffer *dec_buf = NULL;
> > > >  	struct vb2_v4l2_buffer *disp_buf = NULL;
> > > >  	struct vb2_queue *dst_vq = v4l2_m2m_get_dst_vq(m2m_ctx);
> > > > -	struct queue_status_info q_status;
> > > >
> > > >  	dev_dbg(inst->dev->dev, "%s: Fetch output info from
> firmware.",
> > > > __func__);
> > > >
> > > >  	ret = wave5_vpu_dec_get_output_info(inst, &dec_info);
> > > >  	if (ret) {
> > > > -		dev_warn(inst->dev->dev, "%s: could not get output
> info.",
> > > __func__);
> > > > +		dev_dbg(inst->dev->dev, "%s: could not get output
> info.",
> > > > +__func__);
> > >
> > > Wouldn't it be better to check the return value to possibly
> > > differentiate some errors from something similar to EGAIN?
> > >
> >
> > In case of this, when get_result command is requested to VPU, there
> could be no output.
> > So it is not error case and EAGIAN is not proper because the
> > wave5_vpu_dec_finish_decode is triggered by PIC_Done interrupt.
> > So I think it is proper code.
> 
> My worry is that you are silencing possible problems. I checked further,
> it may return EINVAL, EIO and EAGAIN and whatever read_poll_timeout()
> returns on timeout. For most, this trace was noise, so now I agree with
> the change, but can you go find the EINVAL and add a warn_on around it,
> since this is a programming error, it would happen if dec_info is a null
> pointer. Can be patch in itself, before this one.
> 
> >
> >
> > > >  		v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
> > > >  		return;
> > > >  	}
> > > > @@ -441,20 +440,6 @@ static void
> > > > wave5_vpu_dec_finish_decode(struct
> > > vpu_instance *inst)
> > > >  		}
> > > >  		spin_unlock_irqrestore(&inst->state_spinlock, flags);
> > > >  	}
> > > > -
> > > > -	/*
> > > > -	 * During a resolution change and while draining, the
> firmware may
> > > flush
> > > > -	 * the reorder queue regardless of having a matching decoding
> > > operation
> > > > -	 * pending. Only terminate the job if there are no more IRQ
> coming.
> > > > -	 */
> > > > -	wave5_vpu_dec_give_command(inst, DEC_GET_QUEUE_STATUS,
> &q_status);
> > > > -	if (q_status.report_queue_count == 0 &&
> > > > -	    (q_status.instance_queue_count == 0 ||
> > > dec_info.sequence_changed)) {
> > > > -		dev_dbg(inst->dev->dev, "%s: finishing job.\n",
> __func__);
> > > > -		pm_runtime_mark_last_busy(inst->dev->dev);
> > > > -		pm_runtime_put_autosuspend(inst->dev->dev);
> > > > -		v4l2_m2m_job_finish(inst->v4l2_m2m_dev, m2m_ctx);
> > > > -	}
> > > >  }
> > > >
> > > >  static int wave5_vpu_dec_querycap(struct file *file, void *fh,
> > > > struct v4l2_capability *cap) @@ -1146,8 +1131,8 @@ static int
> > > > write_to_ringbuffer(struct vpu_instance *inst, void *buffer,
> > > > size_t b
> > > >  static int fill_ringbuffer(struct vpu_instance *inst)
> > > >  {
> > > >  	struct v4l2_m2m_ctx *m2m_ctx = inst->v4l2_fh.m2m_ctx;
> > > > -	struct v4l2_m2m_buffer *buf, *n;
> > > > -	int ret;
> > > > +	struct vpu_src_buffer *vpu_buf;
> > > > +	int ret = 0;
> > > >
> > > >  	if (m2m_ctx->last_src_buf)  {
> > > >  		struct vpu_src_buffer *vpu_buf =
> > > > wave5_to_vpu_src_buf(m2m_ctx->last_src_buf);
> > > > @@ -1158,9 +1143,8 @@ static int fill_ringbuffer(struct
> > > > vpu_instance
> > > *inst)
> > > >  		}
> > > >  	}
> > > >
> > > > -	v4l2_m2m_for_each_src_buf_safe(m2m_ctx, buf, n) {
> > > > -		struct vb2_v4l2_buffer *vbuf = &buf->vb;
> > > > -		struct vpu_src_buffer *vpu_buf =
> wave5_to_vpu_src_buf(vbuf);
> > > > +	list_for_each_entry(vpu_buf, &inst->avail_src_bufs, list) {
> > > > +		struct vb2_v4l2_buffer *vbuf = &vpu_buf-
> >v4l2_m2m_buf.vb;
> > > >  		struct vpu_buf *ring_buffer = &inst->bitstream_vbuf;
> > > >  		size_t src_size = vb2_get_plane_payload(&vbuf->vb2_buf,
> 0);
> > > >  		void *src_buf = vb2_plane_vaddr(&vbuf->vb2_buf, 0); @@
> -
> > > 1220,9
> > > > +1204,13 @@ static int fill_ringbuffer(struct vpu_instance *inst)
> > > >  			dev_dbg(inst->dev->dev, "last src buffer
> written to
> > > the ring buffer\n");
> > > >  			break;
> > > >  		}
> > > > +
> > > > +		inst->queuing_num++;
> > > > +		list_del_init(&vpu_buf->list);
> > > > +		break;
> > > >  	}
> > > >
> > > > -	return 0;
> > > > +	return ret;
> > > >  }
> > > >
> > > >  static void wave5_vpu_dec_buf_queue_src(struct vb2_buffer *vb) @@
> > > > -1236,6 +1224,11 @@ static void wave5_vpu_dec_buf_queue_src(struct
> > > vb2_buffer *vb)
> > > >  	vbuf->sequence = inst->queued_src_buf_num++;
> > > >
> > > >  	v4l2_m2m_buf_queue(m2m_ctx, vbuf);
> > > > +
> > > > +	INIT_LIST_HEAD(&vpu_buf->list);
> > > > +	mutex_lock(&inst->feed_lock);
> > > > +	list_add_tail(&vpu_buf->list, &inst->avail_src_bufs);
> > > > +	mutex_unlock(&inst->feed_lock);
> > > >  }
> > > >
> > > >  static void wave5_vpu_dec_buf_queue_dst(struct vb2_buffer *vb) @@
> > > > -1385,6 +1378,13 @@ static int streamoff_output(struct vb2_queue *q)
> > > >  	dma_addr_t new_rd_ptr;
> > > >  	struct dec_output_info dec_info;
> > > >  	unsigned int i;
> > > > +	struct vpu_src_buffer *vpu_buf, *tmp;
> > > > +
> > > > +	inst->retry = false;
> > > > +	inst->queuing_num = 0;
> > > > +
> > > > +	list_for_each_entry_safe(vpu_buf, tmp, &inst->avail_src_bufs,
> list)
> > > > +		list_del_init(&vpu_buf->list);
> > > >
> > > >  	for (i = 0; i < v4l2_m2m_num_dst_bufs_ready(m2m_ctx); i++) {
> > > >  		ret = wave5_vpu_dec_set_disp_flag(inst, i); @@ -
> 1580,10
> > > +1580,19 @@
> > > > static void wave5_vpu_dec_device_run(void *priv)
> > > >
> > > >  	dev_dbg(inst->dev->dev, "%s: Fill the ring buffer with new
> > > bitstream data", __func__);
> > > >  	pm_runtime_resume_and_get(inst->dev->dev);
> > > > -	ret = fill_ringbuffer(inst);
> > > > -	if (ret) {
> > > > -		dev_warn(inst->dev->dev, "Filling ring buffer
> failed\n");
> > > > -		goto finish_job_and_return;
> > > > +	if (!inst->retry) {
> > > > +		mutex_lock(&inst->feed_lock);
> > > > +		ret = fill_ringbuffer(inst);
> > > > +		mutex_unlock(&inst->feed_lock);
> > > > +		if (ret < 0) {
> > > > +			dev_warn(inst->dev->dev, "Filling ring buffer
> > > failed\n");
> > > > +			goto finish_job_and_return;
> > > > +		} else if (!inst->eos &&
> > > > +				inst->queuing_num == 0 &&
> > > > +				inst->state == VPU_INST_STATE_PIC_RUN) {
> > > > +			dev_dbg(inst->dev->dev, "%s: no bitstream for
> feeding,
> > > so skip ", __func__);
> > > > +			goto finish_job_and_return;
> > > > +		}
> > > >  	}
> > > >
> > > >  	switch (inst->state) {
> > > > @@ -1639,7 +1648,7 @@ static void wave5_vpu_dec_device_run(void
> *priv)
> > > >  		}
> > > >
> > > >  		if (q_status.instance_queue_count) {
> > > > -			dev_dbg(inst->dev->dev, "%s: leave with active
> job",
> > > __func__);
> > > > +			v4l2_m2m_job_finish(inst->v4l2_m2m_dev,
> m2m_ctx);
> > > >  			return;
> > > >  		}
> > > >
> > > > @@ -1650,14 +1659,21 @@ static void wave5_vpu_dec_device_run(void
> *priv)
> > > >  			dev_err(inst->dev->dev,
> > > >  				"Frame decoding on m2m context (%p),
> fail: %d
> > > (result: %d)\n",
> > > >  				m2m_ctx, ret, fail_res);
> > > > -			break;
> > > > +			goto finish_job_and_return;
> > > > +		}
> > > > +
> > > > +		if (fail_res == WAVE5_SYSERR_QUEUEING_FAIL) {
> > > > +			inst->retry = true;
> > > > +		} else {
> > > > +			inst->retry = false;
> > > > +			if (!inst->eos)
> > > > +				inst->queuing_num--;
> > >
> > > I looked into the original state machine violation you had in
> > > previous version, and I got the impression that the reason you did
> > > hit that was that you actually call device_run passed inst->eos. Its
> > > probably not that simple in practice, but I think you forgot to
> > > adapt the job_ready() ops to prevent more device_run() called passed
> > > CMD_STOP and having all pending buffer written in the ring buffer.
> > >
> > > As a side effect, you endup calling device_run() in a race with the
> > > finish() setting the state to STOP. I really think there is a way to
> > > use
> > > inst->eos boolean to prevent that race in the first place. Might
> > > inst->need to
> > > be combined with checking if you have buffers prior to command stop
> > > that did not yet fit into the ring buffer.
> > >
> >
> >
> > The queuing_num is used to check if there is input data or not, so it
> was declared in the int type.
> > If there is no input data, then the device_run will be not called until
> queuing input data.
> > In case of eos sent, device_run should be ran continuously until getting
> eos from VPU, the code was needed.
> > If my answer is not correct, please let me know.
> 
> Running in loop anything is never the right approach. The device_run()
> should be run when a useful event occur and filtered by the job_ready()
> ops. I believe I'm proposing some hint how to solve this design issue. The
> issue is quite clear with the follow up patch trying to reduce the CPU
> usage due to spinning.



Thanks for your feedback.
But there is one thing to say to you.
After receiving EOS from application, we have to periodically run the device_run to send the DEC_PIC command so that VPU can trigger interrupt  until getting all decoded frames and EOS from Get_Result command.
So even if we sent EOS to VPU once, we should run the device_run function continuously, the above code was added. 
If the job_ready returns false to prevent running the device_run after sending EOS to VPU, then GStreamer pipeline will not be terminated normally because of not receiving all decoded frames.


Thanks


> 
> >
> >
> > > >  		}
> > > > -		/* Return so that we leave this job active */
> > > > -		dev_dbg(inst->dev->dev, "%s: leave with active job",
> > > __func__);
> > > > -		return;
> > > > -	default:
> > > > -		WARN(1, "Execution of a job in state %s illegal.\n",
> > > state_to_str(inst->state));
> > > >  		break;
> > > > +	default:
> > > > +		dev_dbg(inst->dev->dev, "Execution of a job in
> state %s
> > > illegal.\n",
> > > > +			state_to_str(inst->state));
> > > > +
> > > >  	}
> > > >
> > > >  finish_job_and_return:
> > > > @@ -1755,6 +1771,8 @@ static int wave5_vpu_open_dec(struct file
> *filp)
> > > >  	inst->ops = &wave5_vpu_dec_inst_ops;
> > > >
> > > >  	spin_lock_init(&inst->state_spinlock);
> > > > +	mutex_init(&inst->feed_lock);
> > > > +	INIT_LIST_HEAD(&inst->avail_src_bufs);
> > > >
> > > >  	inst->codec_info = kzalloc(sizeof(*inst->codec_info),
> GFP_KERNEL);
> > > >  	if (!inst->codec_info)
> > > > diff --git
> > > > a/drivers/media/platform/chips-media/wave5/wave5-vpuapi.c
> > > > b/drivers/media/platform/chips-media/wave5/wave5-
> > > > vpuapi.c
> > > > index e5e879a13e8b..68d86625538f 100644
> > > > --- a/drivers/media/platform/chips-media/wave5/wave5-vpuapi.c
> > > > +++ b/drivers/media/platform/chips-media/wave5/wave5-vpuapi.c
> > > > @@ -255,6 +255,8 @@ int wave5_vpu_dec_close(struct vpu_instance
> > > > *inst,
> > > u32 *fail_res)
> > > >  	if (inst_count == 1)
> > > >  		pm_runtime_dont_use_autosuspend(vpu_dev->dev);
> > > >
> > > > +	mutex_destroy(&inst->feed_lock);
> > > > +
> > > >  unlock_and_return:
> > > >  	mutex_unlock(&vpu_dev->hw_lock);
> > > >  	pm_runtime_put_sync(inst->dev->dev);
> > > > diff --git
> > > > a/drivers/media/platform/chips-media/wave5/wave5-vpuapi.h
> > > > b/drivers/media/platform/chips-media/wave5/wave5-
> > > > vpuapi.h
> > > > index f3c1ad6fb3be..fd0aef0bac4e 100644
> > > > --- a/drivers/media/platform/chips-media/wave5/wave5-vpuapi.h
> > > > +++ b/drivers/media/platform/chips-media/wave5/wave5-vpuapi.h
> > > > @@ -818,6 +818,9 @@ struct vpu_instance {
> > > >  	bool cbcr_interleave;
> > > >  	bool nv21;
> > > >  	bool eos;
> > > > +	bool retry; /* retry to feed bitstream if failure reason is
> > > WAVE5_SYSERR_QUEUEING_FAIL*/
> > > > +	int queuing_num; /* check if there is input buffer or not */
> > >
> > > This is described as a boolean, but is implemented as a counter.
> > > What does it count exactly ?
> > > I think it needs a better name too.
> > >
> >
> > Please refer to the above comment.
> 
> That won't work for me, I'm requesting you to correct the comment to
> properly say this is a counter.
> 
> Nicolas
> 
> >
> > Thanks
> > Jackson
> >
> >
> > > Nicolas
> > >
> > > > +	struct mutex feed_lock; /* lock for feeding bitstream
> buffers */
> > > >  	struct vpu_buf bitstream_vbuf;
> > > >  	dma_addr_t last_rd_ptr;
> > > >  	size_t remaining_consumed_bytes;