linux-kernel - Re: [PATCH 1/2] drm/amdgpu: Reset GPU on S0ix when device supports BOCO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38340824-9060-c380-d595-c49ef6b8e4ac@amd.com>
Date:   Wed, 29 Mar 2023 23:20:03 -0500
From:   Mario Limonciello <mario.limonciello@....com>
To:     Kai-Heng Feng <kai.heng.feng@...onical.com>,
        Evan Quan <evan.quan@....com>
Cc:     alexander.deucher@....com, christian.koenig@....com,
        Xinhui.Pan@....com, David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Bokun Zhang <Bokun.Zhang@....com>,
        Maxime Ripard <maxime@...no.tech>,
        Tim Huang <tim.huang@....com>,
        Jingyu Wang <jingyuwang_vip@....com>,
        "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Hans de Goede <hdegoede@...hat.com>,
        Hawking Zhang <Hawking.Zhang@....com>,
        Lijo Lazar <lijo.lazar@....com>,
        Andrey Grodzovsky <andrey.grodzovsky@....com>,
        YiPeng Chai <YiPeng.Chai@....com>,
        Somalapuram Amaranath <Amaranath.Somalapuram@....com>,
        Guchun Chen <guchun.chen@....com>,
        Michel Dänzer <mdaenzer@...hat.com>,
        Kenneth Feng <kenneth.feng@....com>,
        Jiansong Chen <Jiansong.Chen@....com>,
        amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm/amdgpu: Reset GPU on S0ix when device supports
 BOCO


On 3/29/23 22:36, Kai-Heng Feng wrote:
> On Wed, Mar 29, 2023 at 9:23 PM Mario Limonciello
> <mario.limonciello@....com> wrote:
>>
>> On 3/29/23 04:59, Kai-Heng Feng wrote:
>>> When the power is lost due to ACPI power resources being turned off, the
>>> driver should reset the GPU so it can work anew.
>>>
>>> First, _PR3 support of the hierarchy needs to be found correctly. Since
>>> the GPU on some discrete GFX cards is behind a PCIe switch, checking the
>>> _PR3 on downstream port alone is not enough, as the _PR3 can associate
>>> to the root port above the PCIe switch.
>> I think this should be split into two commits:
>>
>> * One of them to look at _PR3 further up in hierarchy to fix indication
>> for BOCO support.
> Yes, this part can be split up.
>
>> * One to adjust policy for whether to reset
> IIUC, the GPU only needs to be reset when the power status isn't certain?
Yeah; think of reset as a particular case that all hardware isn't 
initialized.
> Assuming power resources in _PR3 are really disabled, GPU is already
> reset by itself. That means reset shouldn't be necessary for D3cold,
> am I understanding it correctly?
Can we see the acpidump for this system?
> However, this is a desktop plugged with GFX card that has external
> power, does that assumption still stand? Perform resetting on D3cold
> can cover this scenario.

Are you sure it's going to D3cold?  Or is it D3hot?  What does _S0W have 
in this design?

>>
>>> Once the _PR3 is found and BOCO support is correctly marked, use that
>>> information to inform the GPU should be reset. This solves an issue that
>>> system freeze on a Intel ADL desktop that uses S0ix for sleep and D3cold
>>> is supported for the GFX slot.
>> I'm worried this is still papering over an underlying issue with L0s
>> handling on ALD + Navi1x/Navi2x.
> Is it possible to get the ASIC's ASPM parameter under Windows? Knowing
> the difference can be useful.
Evan is in discussion with Windows guys about this issue too.
>
>> Also, what about runtime suspend?  If you unplug the monitor from this
>> dGPU and interact with it over SSH it should go into runtime suspend.
>>
>> Is it working properly for that case now?
> Thanks for the tip. Runtime resume doesn't work at all:
> [ 1087.601631] pcieport 0000:00:01.0: power state changed by ACPI to D0
> [ 1087.613820] pcieport 0000:00:01.0: restoring config space at offset
> 0x2c (was 0x43, writing 0x43)
> [ 1087.613835] pcieport 0000:00:01.0: restoring config space at offset
> 0x28 (was 0x41, writing 0x41)
> [ 1087.613841] pcieport 0000:00:01.0: restoring config space at offset
> 0x24 (was 0xfff10001, writing 0xfff10001)
> [ 1087.613978] pcieport 0000:00:01.0: PME# disabled
> [ 1087.613984] pcieport 0000:00:01.0: waiting 100 ms for downstream
> link, after activation
> [ 1089.330956] pcieport 0000:01:00.0: not ready 1023ms after resume; giving up
> [ 1089.373036] pcieport 0000:01:00.0: Unable to change power state
> from D3cold to D0, device inaccessible
>
> After a short while the whole system froze.
>
> So the upstream port of GFX's PCIe switch cannot be powered on again.

What is the state of the kernel patches while doing this test?
Specifically does this happen without amdgpu.aspm set?  Or this happens
no matter what?

>
> Kai-Heng
>
>>> Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
>>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@...onical.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c   |  3 +++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 ++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 12 +++++-------
>>>    3 files changed, 14 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>>> index 60b1857f469e..407456ac0e84 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>>> @@ -987,6 +987,9 @@ bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
>>>        if (amdgpu_sriov_vf(adev))
>>>                return false;
>>>
>>> +     if (amdgpu_device_supports_boco(adev_to_drm(adev)))
>>> +             return true;
>>> +
>>>    #if IS_ENABLED(CONFIG_SUSPEND)
>>>        return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
>>>    #else
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f5658359ff5c..d56b7a2bafa6 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2181,7 +2181,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
>>>
>>>        if (!(adev->flags & AMD_IS_APU)) {
>>>                parent = pci_upstream_bridge(adev->pdev);
>>> -             adev->has_pr3 = parent ? pci_pr3_present(parent) : false;
>>> +             do {
>>> +                     if (pci_pr3_present(parent)) {
>>> +                             adev->has_pr3 = true;
>>> +                             break;
>>> +                     }
>>> +             } while ((parent = pci_upstream_bridge(parent)));
>>>        }
>>>
>>>        amdgpu_amdkfd_device_probe(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index ba5def374368..5d81fcac4b0a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -2415,10 +2415,11 @@ static int amdgpu_pmops_suspend(struct device *dev)
>>>        struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>        struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>
>>> -     if (amdgpu_acpi_is_s0ix_active(adev))
>>> -             adev->in_s0ix = true;
>>> -     else if (amdgpu_acpi_is_s3_active(adev))
>>> +     if (amdgpu_acpi_is_s3_active(adev) ||
>>> +         amdgpu_device_supports_boco(drm_dev))
>>>                adev->in_s3 = true;
>>> +     else if (amdgpu_acpi_is_s0ix_active(adev))
>>> +             adev->in_s0ix = true;
>>>        if (!adev->in_s0ix && !adev->in_s3)
>>>                return 0;
>>>        return amdgpu_device_suspend(drm_dev, true);
>>> @@ -2449,10 +2450,7 @@ static int amdgpu_pmops_resume(struct device *dev)
>>>                adev->no_hw_access = true;
>>>
>>>        r = amdgpu_device_resume(drm_dev, true);
>>> -     if (amdgpu_acpi_is_s0ix_active(adev))
>>> -             adev->in_s0ix = false;
>>> -     else
>>> -             adev->in_s3 = false;
>>> +     adev->in_s0ix = adev->in_s3 = false;
>>>        return r;
>>>    }
>>>