linux-kernel - Re: [PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump caputrue kernel boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dcd603a-7d62-439d-9a07-9d7d9324e0b6@amd.com>
Date: Thu, 22 Aug 2024 09:05:53 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Lu Yao <yaolu@...inos.cn>, alexander.deucher@....com,
 christian.koenig@....com, Xinhui.Pan@....com, kenneth.feng@....com
Cc: lijo.lazar@....com, Hawking.Zhang@....com, andrealmeid@...lia.com,
 hamza.mahfooz@....com, candice.li@....com, victorchengchi.lu@....com,
 sunil.khatri@....com, Jun.Ma2@....com, kevinyang.wang@....com,
 Tim.Huang@....com, jesse.zhang@....com, amd-gfx@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump
 caputrue kernel boot

On 7/23/2024 04:42, Lu Yao wrote:
> [Why]
> When running kdump test on a machine with R7340 card, a hang is caused due
> to the failure of 'amdgpu_device_ip_init()', error message as follows:
> 
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <si_dpm> failed -22'
>    '[drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate fail (-22).'
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v3_1> failed -22'
>    'amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed'
>    'amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init'
> 
> This is because the caputrue kernel does not power off when it starts,

Presumably you mean:
s/caputrue/capture/

> cause hardware status does not reset.
> 
> [How]
> Add 'is_kdump_kernel()' judgment.
> For 'si_dpm' block, use disable and then enable.
> For 'uvd_v3_1' block, skip loading during the initialization phase.
> 
> Signed-off-by: Lu Yao <yaolu@...inos.cn>
> ---
> During test, I first modified the 'amdgpu_device_ip_hw_init_phase*', make
> it does not end directly when a block hw_init failed.
> 
> After analysis, 'si_dpm' block failed at 'si_dpm_enable()->
> amdgpu_si_is_smc_running()', calling 'si_dpm_disable()' before can resolve.
> 'uvd_v3_1' block failed at 'uvd_v3_1_hw_init()->uvd_v3_1_fw_validate()',
> read mmUVD_FW_STATUS value is 0x27220102, I didn't find out why. But for
> caputrue kernel, UVD is not required. Therefore, don't added this block.

Hmm, a few thoughs.

1) Although you used this for the R7340, these concepts you're 
identifying probably make sense on most AMD GPUs.  SUch checks might be 
better to uplevel to earlier in IP discovery code.

2) I'd actually argue we don't want to have the kdump capture kernel do 
ANY hardware init.  You're going to lose hardware state which "could" be 
valuable information for debugging a problem that caused a panic.

That being said, I'm not really sure what framebuffer can drive the 
display across a kexec if you don't load amdgpu.  What actually happens 
if you blacklist amdgpu in the capture kernel?

What happens with your patch in place?

At least for me I'd like to see a kernel log from both cases.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>   drivers/gpu/drm/amd/amdgpu/si.c            | 6 ++++--
>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 6 ++++++
>   3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 137a88b8de45..52ebc24561c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -50,6 +50,7 @@
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
>   #include <linux/pci.h>
> +#include <linux/crash_dump.h>
>   
>   #include <drm/ttm/ttm_bo.h>
>   #include <drm/ttm/ttm_placement.h>
> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
> index 85235470e872..fc0daed1b829 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
> @@ -2739,7 +2739,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_0_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_OLAND:
> @@ -2757,7 +2758,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_4_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_HAINAN:
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> index a1baa13ab2c2..8700a22ba809 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> @@ -1848,6 +1848,7 @@ static int si_calculate_sclk_params(struct amdgpu_device *adev,
>   static void si_thermal_start_smc_fan_control(struct amdgpu_device *adev);
>   static void si_fan_ctrl_set_default_mode(struct amdgpu_device *adev);
>   static void si_dpm_set_irq_funcs(struct amdgpu_device *adev);
> +static void si_dpm_disable(struct amdgpu_device *adev);
>   
>   static struct si_power_info *si_get_pi(struct amdgpu_device *adev)
>   {
> @@ -6811,6 +6812,11 @@ static int si_dpm_enable(struct amdgpu_device *adev)
>   	struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps;
>   	int ret;
>   
> +	if (is_kdump_kernel()) {
> +		si_dpm_disable(adev);
> +		udelay(50);
> +	}
> +
>   	if (amdgpu_si_is_smc_running(adev))
>   		return -EINVAL;
>   	if (pi->voltage_control || si_pi->voltage_control_svi2)