linux-kernel - Re: [PATCH] drm/amdgpu: add mb for si

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <71571bdc-8310-502f-77b5-954f5efbff91@amd.com>
Date:   Thu, 24 Nov 2022 16:19:14 +0530
From:   "Lazar, Lijo" <lijo.lazar@....com>
To:     "Quan, Evan" <Evan.Quan@....com>,
        李真能 <lizhenneng@...inos.cn>,
        Michel Dänzer <michel.daenzer@...lbox.org>,
        "Koenig, Christian" <Christian.Koenig@....com>,
        "Deucher, Alexander" <Alexander.Deucher@....com>
Cc:     "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>,
        "Pan, Xinhui" <Xinhui.Pan@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: add mb for si



On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
> 
> 
> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>> [AMD Official Use Only - General]
>>
>> Could the attached patch help?
>>
>> Evan
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces@...ts.freedesktop.org> On Behalf Of ???
>>> Sent: Friday, November 18, 2022 5:25 PM
>>> To: Michel Dänzer <michel.daenzer@...lbox.org>; Koenig, Christian
>>> <Christian.Koenig@....com>; Deucher, Alexander
>>> <Alexander.Deucher@....com>
>>> Cc: amd-gfx@...ts.freedesktop.org; Pan, Xinhui <Xinhui.Pan@....com>;
>>> linux-kernel@...r.kernel.org; dri-devel@...ts.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>
>>>
>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>> On 11/18/22 09:01, Christian König wrote:
>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>>> this mb in smc.
>>>>>>
>>>>>> The error message are as follows:
>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>> [amdgpu]] *ERROR*
>>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> 
> The issue is happening in late_init() which eventually does
> 
>      ret = si_thermal_enable_alert(adev, false);
> 
> Just before this, si_thermal_start_thermal_controller is called in 
> hw_init and that enables thermal alert.
> 
> Maybe the issue is with enable/disable of thermal alerts in quick 
> succession. Adding a delay inside si_thermal_start_thermal_controller 
> might help.
> 

On a second look, temperature range is already set as part of 
si_thermal_start_thermal_controller in hw_init
https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780

There is no need to set it again here -

https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635

I think it is safe to remove the call from late_init altogether. Alex/Evan?

Thanks,
Lijo

> Thanks,
> Lijo
> 
>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>> Memory barries are not supposed to be sprinkled around like this, you
>>> need to give a detailed explanation why this is necessary.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Zhenneng Li <lizhenneng@...inos.cn>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>> amdgpu_device *adev)
>>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>     +    mb();
>>>>>> +
>>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>             return true;
>>>> In particular, it makes no sense in this specific place, since it 
>>>> cannot directly
>>> affect the values of rst & clk.
>>>
>>> I thinks so too.
>>>
>>> But when I do reboot test using nine desktop machines,  there maybe 
>>> report
>>> this error on one or two machines after Hundreds of times or 
>>> Thousands of
>>> times reboot test, at the beginning, I use msleep() instead of mb(), 
>>> these
>>> two methods are all works, but I don't know what is the root case.
>>>
>>> I use this method on other verdor's oland card, this error message are
>>> reported again.
>>>
>>> What could be the root reason?
>>>
>>> test environmen:
>>>
>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>
>>> driver: amdgpu
>>>
>>> os: ubuntu 2004
>>>
>>> platform: arm64
>>>
>>> kernel: 5.4.18
>>>
>>>>