[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01414c31-82c8-4de7-920f-87020610580b@nvidia.com>
Date: Fri, 21 Mar 2025 22:05:15 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Bert Karwatzki <spasswolf@....de>, Alex Deucher <alexdeucher@...il.com>,
Kees Cook <kees@...nel.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>, Andy Lutomirski <luto@...nel.org>,
linux-kernel@...r.kernel.org, amd-gfx@...ts.freedesktop.org
Subject: Re: commit 7ffb791423c7 breaks steam game
On 3/21/25 21:24, Ingo Molnar wrote:
>
> * Balbir Singh <balbirs@...dia.com> wrote:
>
>> On 3/20/25 20:01, Ingo Molnar wrote:
>>>
>>> * Balbir Singh <balbirs@...dia.com> wrote:
>>>
>>>> On 3/17/25 00:09, Bert Karwatzki wrote:
>>>>> This is related to the admgpu.gttsize. My laptop has the maximum amount
>>>>> of memory (64G) and usually gttsize is half of main memory size. I just
>>>>> tested with cmdline="nokaslr amdgpi.gttsize=2048" and the problem does
>>>>> not occur. So I did some more testing with varying gttsize and got this
>>>>> for the built-in GPU
>>>>>
>>>>> 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
>>>>> Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
>>>>>
>>>>> (nokaslr is always enabeld)
>>>>> gttssize input behaviour
>>>>> 2048 GOOD
>>>>> 2064 GOOD
>>>>> 2080 SEMIBAD (i.e. noticeable input lag but not as bad as below)
>>>>> 3072 BAD
>>>>> 4096 BAD
>>>>> 8192 BAD
>>>>> 16384 BAD
>>>>>
>>>>> As the build-in GPU has ~512 VRAM there seems to be problems when gttsize >
>>>>> 4*VRAM so I tested for the discrete GPU with 8G of VRAM
>>>>> gttsize input behaviour
>>>>> 49152 GOOD
>>>>> 64000 GOOD
>>>>>
>>>>> So for the discrete GPU increasing gttsize does no reproduce the bug.
>>>>>
>>>>
>>>> Very interesting, I am not a GTT expert, but with these experiments do you
>>>> find anything interesting in
>>>>
>>>> /sys/kernel/debug/x86/pat_memtype_list?
>>>>
>>>> It's weird that you don't see any issues in Xorg (Xfce), just the games.
>>>> May be we should get help from the amd-gfx experts to further diagnose/debug
>>>> the interaction of nokaslr with the game.
>>>
>>> So basically your commit:
>>>
>>> 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems")
>>>
>>> inflicts part of the effects of a 'nokaslr' boot command line option,
>>> and triggers the regression due to that?
>>>
>>> Or is there some other cause?
>>>
>>
>> You are right in your assessment of the root cause. Just to reiterate
>>
>> - nokaslr does not work with the iGPU, specifically for the games
>> mentioned
>>
>> - There is a workaround for the problem, which involves reducing the
>> amdgpu.gttsize
>>
>> - The patch exposes the system to nokaslr situation (effect) when
>> PCI_P2PDMA is enabled
>
> Note that every major x86 distro I checked enables CONFIG_PCI_P2PDMA=y
> and also keeps KASLR enables, so the above qualifiers are immaterial in
> terms of user impact: it's a 100% certainty that distro kernels on
> these systems will regress under these games, right?
>From what I understand, the impact is on the integrated GPU of the machine
for the user. The discrete card works without any problem. The impact
scope is mentioned in
1. https://lore.kernel.org/lkml/146277bb0ecbb392d490683c424b8ae0dfa82838.camel@web.de/
2. https://lore.kernel.org/lkml/6b0c9a4d840757ee54b141ed26f4e81c3e4eaacf.camel@web.de/
>
> What is the importance of the original fix? I should have insisted on a
> fuller changelog, because it's rather thin on details:
>
> If the BAR address is beyond this limit, PCI peer to peer DMA
> mappings fail.
>
The issue I ran into and was exposed to was the following:
- On systems with PCI_P2PDMA enabled, the PCI bar space for the devices
was above the 10TiB limit set up by direct physmap reduction in KASLR.
Basically pci_p2pdma_add_resource() fails, because pci_resource_start(pdev, bar)
lies outside of the direct physmap region and devm_memremap_pages() fails
- The only way to enable PCI peer to peer DMA is to disable KASLR. This
leads to a less than desirable security posture. So using PCI peer to peer
DMA becomes a security tradeoff discussion.
This patch keeps KASLR enabled, with lesser entropy, but allows the
PCI peer to peer DMA mappings to succeed.
> How frequently does this happen and what is the impact to users if this
> happens?
>
It happens a 100% of the time when the BAR space lies beyond the 10TiB
region.
> We might be forced to revert this change if it regresses other systems.
>
I am quite surprised with the select few games that are impacted and that
the rest of the ecosystem is not impacted (from what I understand).
The issue is also exposed with nokaslr, I think we should get the bug fixed
as the issue exists, but is hidden by kaslr. It exists on current kernels
as well.
Looking at the dmesg logs shared by the user, there was no warnings/errors
that caught my attention.
Thanks,
Balbir
Powered by blists - more mailing lists