linux-kernel - Re: [PATCH] drm/amdgpu: Raven: don't allow mixing GTT and VRAM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46bdb101-11c6-46d4-8224-b17d1d356504@amd.com>
Date: Fri, 18 Jul 2025 17:02:31 -0400
From: Leo Li <sunpeng.li@....com>
To: Alex Deucher <alexdeucher@...il.com>, Brian Geffon <bgeffon@...gle.com>
CC: "Wentland, Harry" <Harry.Wentland@....com>, Alex Deucher
	<alexander.deucher@....com>, <christian.koenig@....com>, David Airlie
	<airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Tvrtko Ursulin
	<tvrtko.ursulin@...lia.com>, Yunxiang Li <Yunxiang.Li@....com>, Lijo Lazar
	<lijo.lazar@....com>, Prike Liang <Prike.Liang@....com>, Pratap Nirujogi
	<pratap.nirujogi@....com>, Luben Tuikov <luben.tuikov@....com>,
	<amd-gfx@...ts.freedesktop.org>, <dri-devel@...ts.freedesktop.org>,
	<linux-kernel@...r.kernel.org>, Garrick Evans <garrick@...gle.com>, "Thadeu
 Lima de Souza Cascardo" <cascardo@...lia.com>, <stable@...r.kernel.org>
Subject: Re: [PATCH] drm/amdgpu: Raven: don't allow mixing GTT and VRAM



On 2025-07-18 16:07, Alex Deucher wrote:
> On Fri, Jul 18, 2025 at 1:57 PM Brian Geffon <bgeffon@...gle.com> wrote:
>>
>> On Thu, Jul 17, 2025 at 10:59 AM Alex Deucher <alexdeucher@...il.com> wrote:
>>>
>>> On Wed, Jul 16, 2025 at 8:13 PM Brian Geffon <bgeffon@...gle.com> wrote:
>>>>
>>>> On Wed, Jul 16, 2025 at 5:03 PM Alex Deucher <alexdeucher@...il.com> wrote:
>>>>>
>>>>> On Wed, Jul 16, 2025 at 12:40 PM Brian Geffon <bgeffon@...gle.com> wrote:
>>>>>>
>>>>>> On Wed, Jul 16, 2025 at 12:33 PM Alex Deucher <alexdeucher@...il.com> wrote:
>>>>>>>
>>>>>>> On Wed, Jul 16, 2025 at 12:18 PM Brian Geffon <bgeffon@...gle.com> wrote:
>>>>>>>>
>>>>>>>> Commit 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)")
>>>>>>>> allowed for newer ASICs to mix GTT and VRAM, this change also noted that
>>>>>>>> some older boards, such as Stoney and Carrizo do not support this.
>>>>>>>> It appears that at least one additional ASIC does not support this which
>>>>>>>> is Raven.
>>>>>>>>
>>>>>>>> We observed this issue when migrating a device from a 5.4 to 6.6 kernel
>>>>>>>> and have confirmed that Raven also needs to be excluded from mixing GTT
>>>>>>>> and VRAM.
>>>>>>>
>>>>>>> Can you elaborate a bit on what the problem is?  For carrizo and
>>>>>>> stoney this is a hardware limitation (all display buffers need to be
>>>>>>> in GTT or VRAM, but not both).  Raven and newer don't have this
>>>>>>> limitation and we tested raven pretty extensively at the time.
>>>>>>
>>>>>> Thanks for taking the time to look. We have automated testing and a
>>>>>> few igt gpu tools tests failed and after debugging we found that
>>>>>> commit 81d0bcf99009 is what introduced the failures on this hardware
>>>>>> on 6.1+ kernels. The specific tests that fail are kms_async_flips and
>>>>>> kms_plane_alpha_blend, excluding Raven from this sharing of GTT and
>>>>>> VRAM buffers resolves the issue.
>>>>>
>>>>> + Harry and Leo
>>>>>
>>>>> This sounds like the memory placement issue we discussed last week.
>>>>> In that case, the issue is related to where the buffer ends up when we
>>>>> try to do an async flip.  In that case, we can't do an async flip
>>>>> without a full modeset if the buffers locations are different than the
>>>>> last modeset because we need to update more than just the buffer base
>>>>> addresses.  This change works around that limitation by always forcing
>>>>> display buffers into VRAM or GTT.  Adding raven to this case may fix
>>>>> those tests but will make the overall experience worse because we'll
>>>>> end up effectively not being able to not fully utilize both gtt and
>>>>> vram for display which would reintroduce all of the problems fixed by
>>>>> 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)").
>>>>
>>>> Thanks Alex, the thing is, we only observe this on Raven boards, why
>>>> would Raven only be impacted by this? It would seem that all devices
>>>> would have this issue, no? Also, I'm not familiar with how
>>>
>>> It depends on memory pressure and available memory in each pool.
>>> E.g., initially the display buffer is in VRAM when the initial mode
>>> set happens.  The watermarks, etc. are set for that scenario.  One of
>>> the next frames ends up in a pool different than the original.  Now
>>> the buffer is in GTT.  The async flip interface does a fast validation
>>> to try and flip as soon as possible, but that validation fails because
>>> the watermarks need to be updated which requires a full modeset.

Huh, I'm not sure if this actually is an issue for APUs. The fix that introduced
a check for same memory placement on async flips was on a system with a DGPU,
for which VRAM placement does matter:
https://github.com/torvalds/linux/commit/a7c0cad0dc060bb77e9c9d235d68441b0fc69507

Looking around in DM/DML, for APUs, I don't see any logic that changes DCN
bandwidth validation depending on memory placement. There's a gpuvm_enable flag
for SG, but it's statically set to 1 on APU DCN versions. It sounds like for
APUs specifically, we *should* be able to ignore the mem placement check. I can
spin up a patch to test this out.

Thanks,
Leo

>>>
>>> It's tricky to fix because you don't want to use the worst case
>>> watermarks all the time because that will limit the number available
>>> display options and you don't want to force everything to a particular
>>> memory pool because that will limit the amount of memory that can be
>>> used for display (which is what the patch in question fixed).  Ideally
>>> the caller would do a test commit before the page flip to determine
>>> whether or not it would succeed before issuing it and then we'd have
>>> some feedback mechanism to tell the caller that the commit would fail
>>> due to buffer placement so it would do a full modeset instead.  We
>>> discussed this feedback mechanism last week at the display hackfest.
>>>
>>>
>>>> kms_plane_alpha_blend works, but does this also support that test
>>>> failing as the cause?
>>>
>>> That may be related.  I'm not too familiar with that test either, but
>>> Leo or Harry can provide some guidance.
>>>
>>> Alex
>>
>> Thanks everyone for the input so far. I have a question for the
>> maintainers, given that it seems that this is functionally broken for
>> ASICs which are iGPUs, and there does not seem to be an easy fix, does
>> it make sense to extend this proposed patch to all iGPUs until a more
>> permanent fix can be identified? At the end of the day I'll take
>> functional correctness over performance.
> 
> It's not functional correctness, it's usability.  All that is
> potentially broken is async flips (which depend on memory pressure and
> buffer placement), while if you effectively revert the patch, you end
> up  limiting all display buffers to either VRAM or GTT which may end
> up causing the inability to display anything because there is not
> enough memory in that pool for the next modeset.  We'll start getting
> bug reports about blank screens and failure to set modes because of
> memory pressure.  I think if we want a short term fix, it would be to
> always set the worst case watermarks.  The downside to that is that it
> would possibly cause some working display setups to stop working if
> they were on the margins to begin with.
> 
> Alex
> 
>>
>> Brian
>>
>>>
>>>>
>>>> Thanks again,
>>>> Brian
>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>>
>>>>>>>> Fixes: 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)")
>>>>>>>> Cc: Luben Tuikov <luben.tuikov@....com>
>>>>>>>> Cc: Christian König <christian.koenig@....com>
>>>>>>>> Cc: Alex Deucher <alexander.deucher@....com>
>>>>>>>> Cc: stable@...r.kernel.org # 6.1+
>>>>>>>> Tested-by: Thadeu Lima de Souza Cascardo <cascardo@...lia.com>
>>>>>>>> Signed-off-by: Brian Geffon <bgeffon@...gle.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> index 73403744331a..5d7f13e25b7c 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> @@ -1545,7 +1545,8 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
>>>>>>>>                                             uint32_t domain)
>>>>>>>>  {
>>>>>>>>         if ((domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) &&
>>>>>>>> -           ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type == CHIP_STONEY))) {
>>>>>>>> +           ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type == CHIP_STONEY) ||
>>>>>>>> +            (adev->asic_type == CHIP_RAVEN))) {
>>>>>>>>                 domain = AMDGPU_GEM_DOMAIN_VRAM;
>>>>>>>>                 if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
>>>>>>>>                         domain = AMDGPU_GEM_DOMAIN_GTT;
>>>>>>>> --
>>>>>>>> 2.50.0.727.gbf7dc18ff4-goog
>>>>>>>>