linux-kernel - Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb652517-1391-4d89-fe12-72d3748ec49e@amd.com>
Date:   Tue, 30 Mar 2021 15:02:38 +0200
From:   Christian König <christian.koenig@....com>
To:     Dan Horák <dan@...ny.cz>
Cc:     Xi Ruoyao <xry111@...gyan1223.wang>,
        Christian König <ckoenig.leichtzumerken@...il.com>,
        Alex Deucher <alexander.deucher@....com>,
        David Airlie <airlied@...ux.ie>,
        Felix Kuehling <Felix.Kuehling@....com>,
        linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        amd-gfx@...ts.freedesktop.org, Daniel Vetter <daniel@...ll.ch>,
        stable@...r.kernel.org
Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

Am 30.03.21 um 15:00 schrieb Dan Horák:
> On Tue, 30 Mar 2021 14:55:01 +0200
> Christian König <christian.koenig@....com> wrote:
>
>> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
>>> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>>>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>>>> Hi Christian,
>>>>>>
>>>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:
>>>>>>
>>>>>>            /* validate the parameters */
>>>>>>            if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>>>                size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>>>                    return -EINVAL;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>            saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>            eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>            mapping->start = saddr;
>>>>>>            mapping->last = eaddr;
>>>>>>
>>>>>>
>>>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>>>> "AMDGPU_GPU_PAGE_MASK"
>>>>>> in "validate the parameters" with "PAGE_MASK".
>>> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
>>> "AMDGPU_GPU_PAGE_MASK" :(.
>>>
>>>>> Yeah, good point.
>>>>>
>>>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>>>> this
>>>>>> change.
>>>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>>>> basis while the native page size of the CPU is larger. Let me double
>>>>> check the code.
>>> On my platform, there are two issues both causing the VM corruption.  One is
>>> fixed by agd5f/linux@...01e7.
>> What is fe001e7? A commit id? If yes then that is to short and I can't
>> find it.
> it's a gitlab shortcut for
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&amp;reserved=0

Ah! Yes I would expect that this patch is fixing something in this use case.

Thanks,
Christian.

>
>
> 		Dan
>>>     Another is in Mesa from userspace:  it uses
>>> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
>>> expects it to use `dev_info->virtual_address_alignment`.
>> Mhm, looking at the kernel code I would rather say Mesa is correct and
>> the kernel driver is broken.
>>
>> The gart_page_size is limited by the CPU page size, but the
>> virtual_address_alignment isn't.
>>
>>> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
>>> and make virtual_address_alignment = 4K.  Otherwise, we should fortify the
>>> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the
>>> userspace will just get an EINVAL, instead of a slient VM corruption.  And
>>> someone should tell Mesa developers to fix the code in this case.
>> I rather see this as a kernel bug. Can you test if this code fragment
>> fixes your issue:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 64beb3399604..e1260b517e1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *filp)
>>                   }
>>                   dev_info->virtual_address_alignment =
>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>                   dev_info->pte_fragment_size = (1 <<
>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>> +               dev_info->gart_page_size =
>> dev_info->virtual_address_alignment;
>>                   dev_info->cu_active_number = adev->gfx.cu_info.number;
>>                   dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>                   dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>
>>
>> Thanks,
>> Christian.
>>
>>> --
>>> Xi Ruoyao <xry111@...gyan1223.wang>
>>> School of Aerospace Science and Technology, Xidian University
>>>