[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAd53p7sgowhaFS1b7MM0F0kf14sWf6jbF9T__=4BAMM8bnz3A@mail.gmail.com>
Date: Thu, 30 Mar 2023 11:36:15 +0800
From: Kai-Heng Feng <kai.heng.feng@...onical.com>
To: Mario Limonciello <mario.limonciello@....com>
Cc: alexander.deucher@....com, christian.koenig@....com,
Xinhui.Pan@....com, David Airlie <airlied@...il.com>,
Daniel Vetter <daniel@...ll.ch>,
Bokun Zhang <Bokun.Zhang@....com>,
Maxime Ripard <maxime@...no.tech>,
Tim Huang <tim.huang@....com>,
Jingyu Wang <jingyuwang_vip@....com>,
"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
Hans de Goede <hdegoede@...hat.com>,
Hawking Zhang <Hawking.Zhang@....com>,
Lijo Lazar <lijo.lazar@....com>,
Andrey Grodzovsky <andrey.grodzovsky@....com>,
YiPeng Chai <YiPeng.Chai@....com>,
Somalapuram Amaranath <Amaranath.Somalapuram@....com>,
Evan Quan <evan.quan@....com>,
Guchun Chen <guchun.chen@....com>,
Michel Dänzer <mdaenzer@...hat.com>,
Kenneth Feng <kenneth.feng@....com>,
Jiansong Chen <Jiansong.Chen@....com>,
amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] drm/amdgpu: Reset GPU on S0ix when device supports BOCO
On Wed, Mar 29, 2023 at 9:23 PM Mario Limonciello
<mario.limonciello@....com> wrote:
>
>
> On 3/29/23 04:59, Kai-Heng Feng wrote:
> > When the power is lost due to ACPI power resources being turned off, the
> > driver should reset the GPU so it can work anew.
> >
> > First, _PR3 support of the hierarchy needs to be found correctly. Since
> > the GPU on some discrete GFX cards is behind a PCIe switch, checking the
> > _PR3 on downstream port alone is not enough, as the _PR3 can associate
> > to the root port above the PCIe switch.
>
> I think this should be split into two commits:
>
> * One of them to look at _PR3 further up in hierarchy to fix indication
> for BOCO support.
Yes, this part can be split up.
>
> * One to adjust policy for whether to reset
IIUC, the GPU only needs to be reset when the power status isn't certain?
Assuming power resources in _PR3 are really disabled, GPU is already
reset by itself. That means reset shouldn't be necessary for D3cold,
am I understanding it correctly?
However, this is a desktop plugged with GFX card that has external
power, does that assumption still stand? Perform resetting on D3cold
can cover this scenario.
>
>
> > Once the _PR3 is found and BOCO support is correctly marked, use that
> > information to inform the GPU should be reset. This solves an issue that
> > system freeze on a Intel ADL desktop that uses S0ix for sleep and D3cold
> > is supported for the GFX slot.
>
> I'm worried this is still papering over an underlying issue with L0s
> handling on ALD + Navi1x/Navi2x.
Is it possible to get the ASIC's ASPM parameter under Windows? Knowing
the difference can be useful.
>
> Also, what about runtime suspend? If you unplug the monitor from this
> dGPU and interact with it over SSH it should go into runtime suspend.
>
> Is it working properly for that case now?
Thanks for the tip. Runtime resume doesn't work at all:
[ 1087.601631] pcieport 0000:00:01.0: power state changed by ACPI to D0
[ 1087.613820] pcieport 0000:00:01.0: restoring config space at offset
0x2c (was 0x43, writing 0x43)
[ 1087.613835] pcieport 0000:00:01.0: restoring config space at offset
0x28 (was 0x41, writing 0x41)
[ 1087.613841] pcieport 0000:00:01.0: restoring config space at offset
0x24 (was 0xfff10001, writing 0xfff10001)
[ 1087.613978] pcieport 0000:00:01.0: PME# disabled
[ 1087.613984] pcieport 0000:00:01.0: waiting 100 ms for downstream
link, after activation
[ 1089.330956] pcieport 0000:01:00.0: not ready 1023ms after resume; giving up
[ 1089.373036] pcieport 0000:01:00.0: Unable to change power state
from D3cold to D0, device inaccessible
After a short while the whole system froze.
So the upstream port of GFX's PCIe switch cannot be powered on again.
Kai-Heng
>
> >
> > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@...onical.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 3 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++-------
> > 3 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > index 60b1857f469e..407456ac0e84 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > @@ -987,6 +987,9 @@ bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
> > if (amdgpu_sriov_vf(adev))
> > return false;
> >
> > + if (amdgpu_device_supports_boco(adev_to_drm(adev)))
> > + return true;
> > +
> > #if IS_ENABLED(CONFIG_SUSPEND)
> > return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> > #else
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index f5658359ff5c..d56b7a2bafa6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2181,7 +2181,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
> >
> > if (!(adev->flags & AMD_IS_APU)) {
> > parent = pci_upstream_bridge(adev->pdev);
> > - adev->has_pr3 = parent ? pci_pr3_present(parent) : false;
> > + do {
> > + if (pci_pr3_present(parent)) {
> > + adev->has_pr3 = true;
> > + break;
> > + }
> > + } while ((parent = pci_upstream_bridge(parent)));
> > }
> >
> > amdgpu_amdkfd_device_probe(adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index ba5def374368..5d81fcac4b0a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2415,10 +2415,11 @@ static int amdgpu_pmops_suspend(struct device *dev)
> > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >
> > - if (amdgpu_acpi_is_s0ix_active(adev))
> > - adev->in_s0ix = true;
> > - else if (amdgpu_acpi_is_s3_active(adev))
> > + if (amdgpu_acpi_is_s3_active(adev) ||
> > + amdgpu_device_supports_boco(drm_dev))
> > adev->in_s3 = true;
> > + else if (amdgpu_acpi_is_s0ix_active(adev))
> > + adev->in_s0ix = true;
> > if (!adev->in_s0ix && !adev->in_s3)
> > return 0;
> > return amdgpu_device_suspend(drm_dev, true);
> > @@ -2449,10 +2450,7 @@ static int amdgpu_pmops_resume(struct device *dev)
> > adev->no_hw_access = true;
> >
> > r = amdgpu_device_resume(drm_dev, true);
> > - if (amdgpu_acpi_is_s0ix_active(adev))
> > - adev->in_s0ix = false;
> > - else
> > - adev->in_s3 = false;
> > + adev->in_s0ix = adev->in_s3 = false;
> > return r;
> > }
> >
Powered by blists - more mailing lists