[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <730e85c9-33b5-9c57-7123-057b75cbbddf@nvidia.com>
Date: Mon, 22 Jun 2020 18:42:00 -0700
From: Ralph Campbell <rcampbell@...dia.com>
To: John Hubbard <jhubbard@...dia.com>,
<nouveau@...ts.freedesktop.org>, <linux-kernel@...r.kernel.org>
CC: Jerome Glisse <jglisse@...hat.com>, Christoph Hellwig <hch@....de>,
"Jason Gunthorpe" <jgg@...lanox.com>,
Ben Skeggs <bskeggs@...hat.com>
Subject: Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private
page migration
On 6/22/20 5:30 PM, John Hubbard wrote:
> On 2020-06-22 16:38, Ralph Campbell wrote:
>> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
>> migrate memory in the given address range to device private memory. The
>> source pages might already have been migrated to device private memory.
>> In that case, the source struct page is not checked to see if it is
>> a device private page and incorrectly computes the GPU's physical
>> address of local memory leading to data corruption.
>> Fix this by checking the source struct page and computing the correct
>> physical address.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@...dia.com>
>> ---
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index cc9993837508..f6a806ba3caa 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>> if (!(src & MIGRATE_PFN_MIGRATE))
>> goto out;
>> + if (spage && is_device_private_page(spage)) {
>> + paddr = nouveau_dmem_page_addr(spage);
>> + *dma_addr = DMA_MAPPING_ERROR;
>> + goto done;
>> + }
>> +
>> dpage = nouveau_dmem_page_alloc_locked(drm);
>> if (!dpage)
>> goto out;
>> @@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>> goto out_free_page;
>> }
>> +done:
>> *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
>> ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>> if (src & MIGRATE_PFN_WRITE)
>> @@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>> struct migrate_vma args = {
>> .vma = vma,
>> .start = start,
>> + .src_owner = drm->dev,
>
> Hi Ralph,
>
> This .src_owner setting does look like a required fix, but it seems like
> a completely separate fix from what is listed in this patch's commit
> description, right? (It feels like a casualty of rearranging the patches.)
>
>
> thanks,
It's a bit more complex. There is a catch-22 here with the change to mm/migrate.c.
Without this patch or mm/migrate.c, a second call to clEnqueueSVMMigrateMem()
for the same address range will invalidate the GPU mapping to device private memory
created by the first call.
With this patch but not mm/migrate.c, the first call to clEnqueueSVMMigrateMem()
will fail to migrate normal anonymous memory to device private memory.
Without this patch but including the change to mm/migrate.c, a second call to
clEnqueueSVMMigrateMem() will crash the kernel because dma_map_page() will be
called with the device private PFN which is not a valid CPU physical address.
With both changes, a range of anonymous and device private pages can be migrated
to the GPU and the GPU page tables updated properly.
Powered by blists - more mailing lists