lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5dcd603a-7d62-439d-9a07-9d7d9324e0b6@amd.com>
Date: Thu, 22 Aug 2024 09:05:53 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Lu Yao <yaolu@...inos.cn>, alexander.deucher@....com,
 christian.koenig@....com, Xinhui.Pan@....com, kenneth.feng@....com
Cc: lijo.lazar@....com, Hawking.Zhang@....com, andrealmeid@...lia.com,
 hamza.mahfooz@....com, candice.li@....com, victorchengchi.lu@....com,
 sunil.khatri@....com, Jun.Ma2@....com, kevinyang.wang@....com,
 Tim.Huang@....com, jesse.zhang@....com, amd-gfx@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump
 caputrue kernel boot

On 7/23/2024 04:42, Lu Yao wrote:
> [Why]
> When running kdump test on a machine with R7340 card, a hang is caused due
> to the failure of 'amdgpu_device_ip_init()', error message as follows:
> 
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <si_dpm> failed -22'
>    '[drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate fail (-22).'
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v3_1> failed -22'
>    'amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed'
>    'amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init'
> 
> This is because the caputrue kernel does not power off when it starts,

Presumably you mean:
s/caputrue/capture/

> cause hardware status does not reset.
> 
> [How]
> Add 'is_kdump_kernel()' judgment.
> For 'si_dpm' block, use disable and then enable.
> For 'uvd_v3_1' block, skip loading during the initialization phase.
> 
> Signed-off-by: Lu Yao <yaolu@...inos.cn>
> ---
> During test, I first modified the 'amdgpu_device_ip_hw_init_phase*', make
> it does not end directly when a block hw_init failed.
> 
> After analysis, 'si_dpm' block failed at 'si_dpm_enable()->
> amdgpu_si_is_smc_running()', calling 'si_dpm_disable()' before can resolve.
> 'uvd_v3_1' block failed at 'uvd_v3_1_hw_init()->uvd_v3_1_fw_validate()',
> read mmUVD_FW_STATUS value is 0x27220102, I didn't find out why. But for
> caputrue kernel, UVD is not required. Therefore, don't added this block.

Hmm, a few thoughs.

1) Although you used this for the R7340, these concepts you're 
identifying probably make sense on most AMD GPUs.  SUch checks might be 
better to uplevel to earlier in IP discovery code.

2) I'd actually argue we don't want to have the kdump capture kernel do 
ANY hardware init.  You're going to lose hardware state which "could" be 
valuable information for debugging a problem that caused a panic.

That being said, I'm not really sure what framebuffer can drive the 
display across a kexec if you don't load amdgpu.  What actually happens 
if you blacklist amdgpu in the capture kernel?

What happens with your patch in place?

At least for me I'd like to see a kernel log from both cases.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>   drivers/gpu/drm/amd/amdgpu/si.c            | 6 ++++--
>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 6 ++++++
>   3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 137a88b8de45..52ebc24561c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -50,6 +50,7 @@
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
>   #include <linux/pci.h>
> +#include <linux/crash_dump.h>
>   
>   #include <drm/ttm/ttm_bo.h>
>   #include <drm/ttm/ttm_placement.h>
> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
> index 85235470e872..fc0daed1b829 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
> @@ -2739,7 +2739,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_0_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_OLAND:
> @@ -2757,7 +2758,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_4_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_HAINAN:
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> index a1baa13ab2c2..8700a22ba809 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> @@ -1848,6 +1848,7 @@ static int si_calculate_sclk_params(struct amdgpu_device *adev,
>   static void si_thermal_start_smc_fan_control(struct amdgpu_device *adev);
>   static void si_fan_ctrl_set_default_mode(struct amdgpu_device *adev);
>   static void si_dpm_set_irq_funcs(struct amdgpu_device *adev);
> +static void si_dpm_disable(struct amdgpu_device *adev);
>   
>   static struct si_power_info *si_get_pi(struct amdgpu_device *adev)
>   {
> @@ -6811,6 +6812,11 @@ static int si_dpm_enable(struct amdgpu_device *adev)
>   	struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps;
>   	int ret;
>   
> +	if (is_kdump_kernel()) {
> +		si_dpm_disable(adev);
> +		udelay(50);
> +	}
> +
>   	if (amdgpu_si_is_smc_running(adev))
>   		return -EINVAL;
>   	if (pi->voltage_control || si_pi->voltage_control_svi2)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ