linux-kernel - Re: commit 7ffb791423c7 breaks steam game

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2eb17574-01e3-4608-a16e-56678781e32a@nvidia.com>
Date: Tue, 25 Mar 2025 08:43:54 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Bert Karwatzki <spasswolf@....de>
Cc: Ingo Molnar <mingo@...nel.org>, Kees Cook <kees@...nel.org>,
 Bjorn Helgaas <bhelgaas@...gle.com>,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 Peter Zijlstra <peterz@...radead.org>, Andy Lutomirski <luto@...nel.org>,
 Christian König <christian.koenig@....com>,
 Alex Deucher <alexander.deucher@....com>, linux-kernel@...r.kernel.org,
 amd-gfx@...ts.freedesktop.org
Subject: Re: commit 7ffb791423c7 breaks steam game

On 3/24/25 22:23, Bert Karwatzki wrote:
> Am Sonntag, dem 23.03.2025 um 17:51 +1100 schrieb Balbir Singh:
>> On 3/22/25 23:23, Bert Karwatzki wrote:
>>> The problem occurs in this part of ttm_tt_populate(), in the nokaslr case
>>> the loop is entered and repeatedly run because ttm_dma32_pages allocated exceeds
>>> the ttm_dma32_pages_limit which leads to lots of calls to ttm_global_swapout().
>>>
>>> if (!strcmp(get_current()->comm, "stellaris"))
>>> 	printk(KERN_INFO "%s: ttm_pages_allocated=0x%llx ttm_pages_limit=0x%lx ttm_dma32_pages_allocated=0x%llx ttm_dma32_pages_limit=0x%lx\n",
>>> 			__func__, ttm_pages_allocated.counter, ttm_pages_limit, ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit);
>>> while (atomic_long_read(&ttm_pages_allocated) > ttm_pages_limit ||
>>>        atomic_long_read(&ttm_dma32_pages_allocated) >
>>>        ttm_dma32_pages_limit) {
>>>
>>> 	if (!strcmp(get_current()->comm, "stellaris"))
>>> 	printk(KERN_INFO "%s: count=%d ttm_pages_allocated=0x%llx ttm_pages_limit=0x%lx ttm_dma32_pages_allocated=0x%llx ttm_dma32_pages_limit=0x%lx\n",
>>> 			__func__, count++, ttm_pages_allocated.counter, ttm_pages_limit, ttm_dma32_pages_allocated.counter, ttm_dma32_pages_limit);
>>> 	ret = ttm_global_swapout(ctx, GFP_KERNEL);
>>> 	if (ret == 0)
>>> 		break;
>>> 	if (ret < 0)
>>> 		goto error;
>>> }
>>>
>>> In the case without nokaslr on the number of ttm_dma32_pages_allocated is 0 because
>>> use_dma32 == false in this case.
>>>
>>> So why is use_dma32 enabled with nokaslr? Some more printk()s give this result:
>>>
>>> The GPUs:
>>> built-in:
>>> 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
>>> discrete:
>>> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
>>>
>>> With nokaslr:
>>> [    1.266517] [    T328] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>> [    1.266519] [    T328] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [    1.266520] [    T328] dma_direct_all_ram_mapped: returning true
>>> [    1.266521] [    T328] dma_addressing_limited: returning ret = 0
>>> [    1.266521] [    T328] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [    1.266525] [    T328] entering ttm_device_init, use_dma32 = 0
>>> [    1.267115] [    T328] entering ttm_pool_init, use_dma32 = 0
>>>
>>> [    3.965669] [    T328] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0x3fffffffffff
>>> [    3.965671] [    T328] dma_addressing_limited: returning true
>>> [    3.965672] [    T328] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 1
>>> [    3.965674] [    T328] entering ttm_device_init, use_dma32 = 1
>>> [    3.965747] [    T328] entering ttm_pool_init, use_dma32 = 1
>>>
>>> Without nokaslr:
>>> [    1.300907] [    T351] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffff
>>> [    1.300909] [    T351] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [    1.300910] [    T351] dma_direct_all_ram_mapped: returning true
>>> [    1.300910] [    T351] dma_addressing_limited: returning ret = 0
>>> [    1.300911] [    T351] amdgpu 0000:03:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [    1.300915] [    T351] entering ttm_device_init, use_dma32 = 0
>>> [    1.301210] [    T351] entering ttm_pool_init, use_dma32 = 0
>>>
>>> [    4.000602] [    T351] dma_addressing_limited: mask = 0xfffffffffff bus_dma_limit = 0x0 required_mask = 0xfffffffffff
>>> [    4.000603] [    T351] dma_addressing_limited: ops = 0000000000000000 use_dma_iommu(dev) = 0
>>> [    4.000604] [    T351] dma_direct_all_ram_mapped: returning true
>>> [    4.000605] [    T351] dma_addressing_limited: returning ret = 0
>>> [    4.000606] [    T351] amdgpu 0000:08:00.0: amdgpu: amdgpu_ttm_init: calling ttm_device_init() with use_dma32 = 0
>>> [    4.000610] [    T351] entering ttm_device_init, use_dma32 = 0
>>> [    4.000687] [    T351] entering ttm_pool_init, use_dma32 = 0
>>>
>>> So with nokaslr the reuqired mask for the built-in GPU changes from 0xfffffffffff
>>> to 0x3fffffffffff which causes dma_addressing_limited to return true which causes
>>> the ttm_device init to be called with use_dma32 = true.
>>
>> Thanks, this is really the root cause, from what I understand.
>>
>>>  It also show that for the discreate GPU nothing changes so the bug does not occur
>>> there.
>>>
>>> I also was able to work around the bug by calling ttm_device_init() with use_dma32=false
>>> from amdgpu_ttm_init()  (drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c) but I'm not sure if this
>>> has unwanted side effects.
>>>
>>> int amdgpu_ttm_init(struct amdgpu_device *adev)
>>> {
>>> 	uint64_t gtt_size;
>>> 	int r;
>>>
>>> 	mutex_init(&adev->mman.gtt_window_lock);
>>>
>>> 	dma_set_max_seg_size(adev->dev, UINT_MAX);
>>> 	/* No others user of address space so set it to 0 */
>>> 	dev_info(adev->dev, "%s: calling ttm_device_init() with use_dma32 = 0 ignoring %d\n", __func__, dma_addressing_limited(adev->dev));
>>> 	r = ttm_device_init(&adev->mman.bdev, &amdgpu_bo_driver, adev->dev,
>>> 			       adev_to_drm(adev)->anon_inode->i_mapping,
>>> 			       adev_to_drm(adev)->vma_offset_manager,
>>> 			       adev->need_swiotlb,
>>> 			       false /* use_dma32 */);
>>> 	if (r) {
>>> 		DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
>>> 		return r;
>>> 	}
>>>
>>
>> I think this brings us really close, instead of forcing use_dma32 to false, I wonder if we need something like
>>
>> uin64_t dma_bits = fls64(dma_get_mask(adev->dev));
>>
>> to ttm_device_init, pass the last argument (use_dma32) as dma_bits < 32?
>>
>>
>> Thanks,
>> Balbir Singh
>>
> 
> Do these address bits have to shift when using nokaslr or PCI_P2PDMA, I think
> this shift cause the increase of the required_dma_mask to 0x3fffffffffff?
> 

That depends on dma ops, as per dma-api.rst

"
	dma_get_required_mask(struct device *dev)

This API returns the mask that the platform requires to
operate efficiently.  Usually this means the returned mask
is the minimum required to cover all of memory."

I think the assumption that dma_addressing_limited(), due to dma_mask
for the device being smaller/shorter than required_mask implies dma32
 = true, is incorrect.



> @@ -104,4 +104,4 @@
>        fe30300000-fe303fffff : 0000:04:00.0
>      fe30400000-fe30403fff : 0000:04:00.0
>      fe30404000-fe30404fff : 0000:04:00.0
> -afe00000000-affffffffff : 0000:03:00.0
> +3ffe00000000-3fffffffffff : 0000:03:00.0
> 
> And what memory is this? It's 8G in size so it could be the RAM of the discrete
> GPU (which is at PCI 0000:03:00.0), but that is already here (part of
> /proc/iomem):
> 
> 

I think the mask is independent of what is mapped there, all it says it
it needs to address upto 46 bits in the mask

Balbir Singh