linux-kernel - Re: [PATCH v5 3/4] uapi: linux: add MT8188 AIE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <be84455d1ea9573c20da5e586e57f0fa18135f0c.camel@mediatek.com>
Date: Mon, 7 Apr 2025 03:57:56 +0000
From: CK Hu (胡俊光) <ck.hu@...iatek.com>
To: "linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
	AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>,
	"robh@...nel.org" <robh@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, "mchehab@...nel.org"
	<mchehab@...nel.org>, Bo Kong (孔波)
	<Bo.Kong@...iatek.com>, "linux-mediatek@...ts.infradead.org"
	<linux-mediatek@...ts.infradead.org>
CC: Zhaoyuan Chen (陈兆远)
	<zhaoyuan.chen@...iatek.com>, Teddy Chen (陳乾元)
	<Teddy.Chen@...iatek.com>, Project_Global_Chrome_Upstream_Group
	<Project_Global_Chrome_Upstream_Group@...iatek.com>
Subject: Re: [PATCH v5 3/4] uapi: linux: add MT8188 AIE

On Thu, 2025-04-03 at 15:38 +0800, bo.kong wrote:
> From: Bo Kong <Bo.Kong@...iatek.com>
> 
> Add AIE control related definitions.

Laurent has asked you to provide document of the UAPI.
If you have no idea what to write to this document,
I ask you question in this mail and you write the answer to the document.

> 
> Signed-off-by: Bo Kong <Bo.Kong@...iatek.com>
> ---

[snip]

> diff --git a/include/uapi/linux/mtk_aie_v4l2_controls.h b/include/uapi/linux/mtk_aie_v4l2_controls.h
> new file mode 100644
> index 000000000000..952dcdb23283
> --- /dev/null
> +++ b/include/uapi/linux/mtk_aie_v4l2_controls.h
> @@ -0,0 +1,308 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * AIE Controls Header
> + *
> + * Copyright (c) 2020 MediaTek Inc.
> + * Author: Bo Kong <bo.kong@...iatek.com>
> + */
> +
> +#ifndef __MTK_AIE_V4L2_CONTROLS_H__
> +#define __MTK_AIE_V4L2_CONTROLS_H__
> +
> +#include <linux/types.h>
> +
> +/*
> + * The base for the mediatek Face Detection driver controls.
> + * We reserve 16 controls for this driver.
> + * Each CID represents different stages of AIE, with different structures and functions
> + * and cannot be reused
> + */
> +#define V4L2_CID_USER_MTK_FD_BASE (V4L2_CID_USER_BASE + 0x1fd0)
> +#define V4L2_CID_MTK_AIE_INIT (V4L2_CID_USER_MTK_FD_BASE + 1)
> +#define V4L2_CID_MTK_AIE_PARAM (V4L2_CID_USER_MTK_FD_BASE + 2)
> +
> +#define MAX_FACE_NUM			1024
> +#define FLD_CUR_LANDMARK		11
> +#define FLD_MAX_FRAME			15
> +
> +/**
> + * struct v4l2_ctrl_aie_init - aie init parameters.
> + *
> + * @max_img_width: maximum width of the source image.
> + * @max_img_height: maximum height of the source image.
> + * @pyramid_width: maximum width of the base pyramid.
> + * @pyramid_height: maximum height of the base pyramid.
> + * @feature_threshold: The threshold for the face confidence.Range: 100 ~ 400.
> + *                     The larger the value,the lower the face recognition rate
> + */
> +struct v4l2_ctrl_aie_init {
> +	__u32 max_img_width;
> +	__u32 max_img_height;

max image width/height is hardware limitation. This should be stored in driver.
User space program should use some V4L2 query interface to get this information.

> +	__u32 pyramid_width;
> +	__u32 pyramid_height;

Ditto.

> +	__s32 feature_threshold;

How to decide this value?
I guess there is a tuning process to find a reasonable threshold.
Is there only one reasonable threshold for all scenario?
If so, write done this one threshold in driver and user space is not necessary to set this value to driver.
If the threshold would vary for different scenario, for example, 168 for indoor video and 258 for outdoor video,
add comment to describe this so user space program have to detect the scenario of input video.

> +};
> +
> +/**
> + * struct aie_roi_coordinate - aie roi parameters.
> + *
> + * @x1: x1 of the roi coordinate.
> + * @y1: y1 of the roi coordinate.
> + * @x2: x2 of the roi coordinate.
> + * @y2: y2 of the roi coordinate.
> + */
> +struct aie_roi_coordinate {
> +	__u32 x1;
> +	__u32 y1;
> +	__u32 x2;
> +	__u32 y2;
> +};
> +
> +/**
> + * struct aie_padding_size - aie padding parameters.
> + *
> + * @left: the size of padding left.
> + * @right: the size of padding right.
> + * @down: the size of padding below.
> + * @up: the size of padding above.
> + */
> +struct aie_padding_size {
> +	__u32 left;
> +	__u32 right;
> +	__u32 down;
> +	__u32 up;
> +};
> +
> +/**
> + * struct v4l2_fld_crop_rip_rop - aie fld parameters.
> + *
> + * @fld_in_crop_x1: x1 of the crop coordinate.
> + * @fld_in_crop_y1: y1 of the crop coordinate.
> + * @fld_in_crop_x2: x2 of the crop coordinate.
> + * @fld_in_crop_y2: y2 of the crop coordinate.
> + * @fld_in_rip: fld in rip.
> + * @fld_in_rop: fld in rop.
> + */
> +struct v4l2_fld_crop_rip_rop {
> +	__u32 fld_in_crop_x1;
> +	__u32 fld_in_crop_y1;
> +	__u32 fld_in_crop_x2;
> +	__u32 fld_in_crop_y2;
> +	__u32 fld_in_rip;
> +	__u32 fld_in_rop;
> +};
> +
> +/**
> + * struct fd_ret - aie fd result parameters.
> + *
> + * @anchor_x0: X coordinate of the top-left corner of each detected face.
> + * @anchor_x1: X coordinate of the bottom-right corner of each detected face.
> + * @anchor_y0: Y coordinate of the top-left corner of each detected face.
> + * @anchor_y1: Y coordinate of the bottom-right corner of each detected face.
> + * @rop_landmark_score0: Score for the first rotation pose landmark.
> + * @rop_landmark_score1: Score for the second rotation pose landmark.
> + * @rop_landmark_score2: Score for the third rotation pose landmark.
> + * @anchor_score: Score for the anchor points.
> + * @rip_landmark_score0: Score for the first rotation-invariant pose landmark.
> + * @rip_landmark_score1: Score for the second rotation-invariant pose landmark.
> + * @rip_landmark_score2: Score for the third rotation-invariant pose landmark.
> + * @rip_landmark_score3: Score for the fourth rotation-invariant pose landmark.
> + * @rip_landmark_score4: Score for the fifth rotation-invariant pose landmark.
> + * @rip_landmark_score5: Score for the sixth rotation-invariant pose landmark.
> + * @rip_landmark_score6: Score for the seventh rotation-invariant pose landmark.
> + * @face_result_index: Index of each detected face.
> + * @anchor_index: Index of the anchor points.
> + * @fd_partial_result: Partial face detection result.
> + */
> +struct fd_ret {
> +	__u16 anchor_x0[MAX_FACE_NUM];
> +	__u16 anchor_x1[MAX_FACE_NUM];
> +	__u16 anchor_y0[MAX_FACE_NUM];
> +	__u16 anchor_y1[MAX_FACE_NUM];

I don't know why use 'anchor' for face coordinate.
Maybe this is trivial for AI domain, but I think there should be some brief introduction for who don't understand AI.

> +	__s16 rop_landmark_score0[MAX_FACE_NUM];
> +	__s16 rop_landmark_score1[MAX_FACE_NUM];
> +	__s16 rop_landmark_score2[MAX_FACE_NUM];

I don't know what is landmark.
Why there are 'first', 'second', and 'third' landmark?
Maybe this is trivial for AI domain, but I think there should be some brief introduction for who don't understand AI.

> +	__s16 anchor_score[MAX_FACE_NUM];
> +	__s16 rip_landmark_score0[MAX_FACE_NUM];
> +	__s16 rip_landmark_score1[MAX_FACE_NUM];
> +	__s16 rip_landmark_score2[MAX_FACE_NUM];
> +	__s16 rip_landmark_score3[MAX_FACE_NUM];
> +	__s16 rip_landmark_score4[MAX_FACE_NUM];
> +	__s16 rip_landmark_score5[MAX_FACE_NUM];
> +	__s16 rip_landmark_score6[MAX_FACE_NUM];
> +	__u16 face_result_index[MAX_FACE_NUM];

What is this index?
If fd_pyramid0_num is 100, so anchor_x0[0] ~ anchor_x0[99] is valid.
I don't know why need this index? 

> +	__u16 anchor_index[MAX_FACE_NUM];

What is this index?

> +	__u32 fd_partial_result;

What is this result?
Describe more so that everyone could understand this.

> +};

It's better that one face information is placed together in DRAM because it's easy to cache hit.
Then this structure would reform as:

struct face {
	__u16 anchor_x0;
	__u16 anchor_x1;
	__u16 anchor_y0;
	__u16 anchor_y1;
...
	__u16 anchor_index;
;

struct fd_ret {
	struct face face_result[MAX_FACE_NUM];
	__u32 fd_partial_result;
};

> +
> +/**
> + * struct fd_result - Face detection results for different pyramid levels.
> + *
> + * @fd_pyramid0_num: Number of faces detected at pyramid level 0.
> + * @fd_pyramid1_num: Number of faces detected at pyramid level 1.
> + * @fd_pyramid2_num: Number of faces detected at pyramid level 2.
> + * @fd_total_num: Total number of faces detected across all pyramid levels.
> + * @pyramid0_result: Detection results for pyramid level 0.
> + * @pyramid1_result: Detection results for pyramid level 1.
> + * @pyramid2_result: Detection results for pyramid level 2.
> + */
> +struct fd_result {
> +	__u16 fd_pyramid0_num;
> +	__u16 fd_pyramid1_num;
> +	__u16 fd_pyramid2_num;
> +	__u16 fd_total_num;

fd_total_num is redundant.
User space program could use (fd_pyramid0_num + fd_pyramid1_num + fd_pyramid2_num) to get this.
So drop this.

> +	struct fd_ret pyramid0_result;
> +	struct fd_ret pyramid1_result;
> +	struct fd_ret pyramid2_result;
> +};
> +
> +/**
> + * struct attr_result - Attribute detection results parameters.
> + *
> + * @gender_ret: Gender detection results.
> + * @ethnicity_ret: Ethnicity detection results.
> + * @merged_age_ret: Merged age detection results.
> + * @merged_gender_ret: Merged gender detection results.
> + * @merged_is_indian_ret: Merged results indicating if the subject is Indian.
> + * @merged_ethnicity_ret: Merged ethnicity detection results.
> + */
> +struct attr_result {
> +	__s16 gender_ret[2][64];

Is the value 0 for woman and 1 for man?
Why the array is [2][64]?
Explain what the first index mean and what the second index mean.
attribute mode has no coordinate information, so we just know how many man and how many woman,
but we don't know which one (the position in video) is man or woman?

> +	__s16 ethnicity_ret[4][64];

0 for Chinese, 1 for Japanese, 2 for Korean?

> +	__s16 merged_age_ret[2];

Why the term 'merged'?

> +	__s16 merged_gender_ret[2];

Ditto.

> +	__s16 merged_is_indian_ret[2];

indian is India people or indigenous people of America or both?

> +	__s16 merged_ethnicity_ret[4];
> +};
> +
> +/**
> + * struct fld_landmark - FLD coordinates parameters.
> + *
> + * @x: X coordinate of the facial landmark.
> + * @y: Y coordinate of the facial landmark.
> + */
> +struct fld_landmark {
> +	__u16 x;
> +	__u16 y;
> +};
> +
> +/**
> + * struct fld_result - FLD detection results parameters.
> + *
> + * @fld_landmark: Array of facial landmarks, each with X and Y coordinates.
> + * @fld_out_rip: Output rotation-invariant pose value.
> + * @fld_out_rop: Output rotation pose value.
> + * @confidence: Confidence score of the facial landmark detection.
> + * @blinkscore: Blink score indicating the likelihood of eye blink.
> + */
> +struct fld_result {
> +	struct fld_landmark fld_landmark[FLD_CUR_LANDMARK];
> +	__u16 fld_out_rip;
> +	__u16 fld_out_rop;
> +	__u16 confidence;
> +	__s16 blinkscore;
> +};
> +
> +/**
> + * struct aie_enq_info - V4L2 Kernelspace parameters.
> + *
> + * @sel_mode: select a mode(FDMODE, ATTRIBUTEMODE, FLDMODE) for current fd.
> + *           FDMODE: Face Detection.
> + *           ATTRIBUTEMODE: Gender and ethnicity detection
> + *           FLDMODE: Locations of eyebrows, eyes, ears, nose,and mouth
> + * @src_img_fmt: source image format.
> + * @src_img_width: the width of the source image.
> + * @src_img_height: the height of the source image.
> + * @src_img_stride: the stride of the source image.
> + * @pyramid_base_width: pyramid is the size of resizer, the width of the base pyramid.
> + * @pyramid_base_height: pyramid is the size of resizer, the width of the base pyramid.
> + * @number_of_pyramid: number of pyramid, min: 1, max: 3.
> + * @rotate_degree: the rotate degree of the image.
> + * @en_roi: enable roi(roi is a box diagram that selects a rectangle in a picture).
> + *          when en_roi is enable, AIE will return a rectangle face detection result
> + * @src_roi: roi params.
> + * @en_padding: enable padding, this is only used on the hardware of yuv to rgb.
> + *              and has noting to do with fd_mode
> + * @src_padding: padding params.
> + * @freq_level: frequency level, Get value from user space enque.
> + * @fld_face_num: the number of faces in fld.
> + *                user space tells driver the number of detections.
> + * @fld_input: fld input params.
> + * @src_img_addr: Source image address.
> + * @src_img_addr_uv: Source image address for UV plane.
> + * @fd_out: Face detection results.
> + * @attr_out: Attribute detection results.
> + * @fld_out: Array of facial landmark detection results for multiple frames.
> + * @irq_status: Interrupt request status.
> + */
> +struct aie_enq_info {
> +	__u32 sel_mode;
> +	__u32 src_img_fmt;
> +	__u32 src_img_width;
> +	__u32 src_img_height;
> +	__u32 src_img_stride;
> +	__u32 pyramid_base_width;
> +	__u32 pyramid_base_height;
> +	__u32 number_of_pyramid;
> +	__u32 rotate_degree;
> +	int en_roi;
> +	struct aie_roi_coordinate src_roi;
> +	int en_padding;
> +	struct aie_padding_size src_padding;
> +	unsigned int freq_level;
> +	unsigned int fld_face_num;
> +	struct v4l2_fld_crop_rip_rop fld_input[FLD_MAX_FRAME];

Above information is set by user space, so driver is not necessary to return these information back to user space.
Drop these.

> +	__u32 src_img_addr;
> +	__u32 src_img_addr_uv;

If user space program need to access source image buffer, it should map it itself.
Do not pass this address information from driver.
Drop these.

> +	struct fd_result fd_out;
> +	struct attr_result attr_out;
> +	struct fld_result fld_out[FLD_MAX_FRAME];

Union these three structure because hardware would work in one mode in one time.

> +	__u32 irq_status;

User space program should not process irq_status.
So drop this.

> +};
> +
> +/**
> + * struct v4l2_ctrl_aie_param - V4L2 Userspace parameters.
> + *
> + * @fd_mode: select a mode(FDMODE, ATTRIBUTEMODE, FLDMODE) for current fd.
> + *           FDMODE: Face Detection.
> + *           ATTRIBUTEMODE: Gender and ethnicity detection
> + *           FLDMODE: Locations of eyebrows, eyes, ears, nose,and mouth
> + * @src_img_fmt: source image format.
> + * @src_img_width: the width of the source image.
> + * @src_img_height: the height of the source image.
> + * @src_img_stride: the stride of the source image.
> + * @pyramid_base_width: pyramid is the size of resizer, the width of the base pyramid.
> + * @pyramid_base_height: pyramid is the size of resizer, the width of the base pyramid.
> + * @number_of_pyramid: number of pyramid, min: 1, max: 3.
> + * @rotate_degree: the rotate degree of the image.
> + * @en_roi: enable roi(roi is a box diagram that selects a rectangle in a picture).
> + *          when en_roi is enable, AIE will return a rectangle face detection result
> + * @src_roi: roi params.
> + * @en_padding: enable padding, this is only used on the hardware of yuv to rgb.
> + *              and has noting to do with fd_mode
> + * @src_padding: padding params.
> + * @freq_level: frequency level, Get value from user space enque.
> + * @fld_face_num: the number of faces in fld.
> + *                user space tells driver the number of detections.
> + * @fld_input: fld input params.
> + */
> +struct v4l2_ctrl_aie_param {
> +	__u32 fd_mode;
> +	__u32 src_img_fmt;
> +	__u32 src_img_width;
> +	__u32 src_img_height;
> +	__u32 src_img_stride;

Use V4L2 standard interface to set source image format/width/height/stride.

> +	__u32 pyramid_base_width;
> +	__u32 pyramid_base_height;

Does pyramid width/height has limitation?
If source image resolution is 3840x2160, does pyramid resolution could be 3840x2160?
If there are hardware limitation, query this limitation from driver.

Some AI model just accept some certain resolution, for example 360x360.
In this case, any source resolution should resize to this resolution.
Does AIE need to resize to some fixed resolution? Or any resolution is acceptable?

> +	__u32 number_of_pyramid;

Why need multiple pyramid?
Does this AI model need the face resolution around 100x100?
If a face it 400x400, so it would not be detect so need to resize this face to multiple smaller resolution.
In first pyramid face resolution is 400x400.
In second pyramid face resolution is 200x200.
In third pyramid face resolution is 100x100.
So this face would be detected in third pyramid.

If this is right, add this information to document.
If not, add right information to document.

> +	__u32 rotate_degree;
> +	__s32 en_roi;
> +	struct aie_roi_coordinate src_roi;
> +	__s32 en_padding;

If padding has nothing to do with fd_mode, drop it.

> +	struct aie_padding_size src_padding;
> +	__u32 freq_level;

freq_level is FPS?
What is freq_level and how hardware work with this?
The comment says "Get value from user space enque", what's this?
Describe more detail about this.

> +	__u32 fld_face_num;
> +	struct v4l2_fld_crop_rip_rop fld_input[FLD_MAX_FRAME];

Does the fld mode need fd mode result?
The face_num is get from fd mode, and the fld_input coordinate is get from fd mode also.
If so, describe more about this.

Regards,
CK

> +};

> +
> +#endif /* __MTK_AIE_V4L2_CONTROLS_H__ */
> diff --git a/include/uapi/linux/videodev2.h b/include/uapi/linux/videodev2.h
> index c8cb2796130f..329bbac9239e 100644
> --- a/include/uapi/linux/videodev2.h
> +++ b/include/uapi/linux/videodev2.h
> @@ -858,6 +858,9 @@ struct v4l2_pix_format {
>  #define V4L2_META_FMT_RK_ISP1_STAT_3A	v4l2_fourcc('R', 'K', '1', 'S') /* Rockchip ISP1 3A Statistics */
>  #define V4L2_META_FMT_RK_ISP1_EXT_PARAMS	v4l2_fourcc('R', 'K', '1', 'E') /* Rockchip ISP1 3a Extensible Parameters */
>  
> +/* Vendor-specific definition: used for the MediaTek camera subsystem's face detection results */
> +#define V4L2_META_FMT_MTFD_RESULT	v4l2_fourcc('M', 'T', 'f', 'd')
> +
>  /* Vendor specific - used for RaspberryPi PiSP */
>  #define V4L2_META_FMT_RPI_BE_CFG	v4l2_fourcc('R', 'P', 'B', 'C') /* PiSP BE configuration */
>  #define V4L2_META_FMT_RPI_FE_CFG	v4l2_fourcc('R', 'P', 'F', 'C') /* PiSP FE configuration */
> @@ -1965,6 +1968,9 @@ enum v4l2_ctrl_type {
>  	V4L2_CTRL_TYPE_AV1_TILE_GROUP_ENTRY = 0x281,
>  	V4L2_CTRL_TYPE_AV1_FRAME	    = 0x282,
>  	V4L2_CTRL_TYPE_AV1_FILM_GRAIN	    = 0x283,
> +
> +	V4L2_CTRL_TYPE_AIE_INIT		= 0x0290,
> +	V4L2_CTRL_TYPE_AIE_PARAM	= 0x0291,
>  };
>  
>  /*  Used in the VIDIOC_QUERYCTRL ioctl for querying controls */