[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e5781df5-5244-465e-b986-c1802e1262db@gmail.com>
Date: Wed, 6 Mar 2024 16:04:56 +0100
From: Christian König <ckoenig.leichtzumerken@...il.com>
To: Alex Deucher <alexdeucher@...il.com>, "Khatri, Sunil" <sukhatri@....com>
Cc: Christian König <christian.koenig@....com>,
Sunil Khatri <sunil.khatri@....com>, Alex Deucher
<alexander.deucher@....com>, Shashank Sharma <shashank.sharma@....com>,
amd-gfx@...ts.freedesktop.org, Pan@...-sunil-navi33.amd.com,
Xinhui <Xinhui.Pan@....com>, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Mukul Joshi <mukul.joshi@....com>,
Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@....com>
Subject: Re: [PATCH] drm/amdgpu: cache in more vm fault information
Am 06.03.24 um 15:29 schrieb Alex Deucher:
> On Wed, Mar 6, 2024 at 8:04 AM Khatri, Sunil <sukhatri@....com> wrote:
>>
>> On 3/6/2024 6:12 PM, Christian König wrote:
>>> Am 06.03.24 um 11:40 schrieb Khatri, Sunil:
>>>> On 3/6/2024 3:37 PM, Christian König wrote:
>>>>> Am 06.03.24 um 10:04 schrieb Sunil Khatri:
>>>>>> When an page fault interrupt is raised there
>>>>>> is a lot more information that is useful for
>>>>>> developers to analyse the pagefault.
>>>>> Well actually those information are not that interesting because
>>>>> they are hw generation specific.
>>>>>
>>>>> You should probably rather use the decoded strings here, e.g. hub,
>>>>> client, xcc_id, node_id etc...
>>>>>
>>>>> See gmc_v9_0_process_interrupt() an example.
>>>>> I saw this v9 does provide more information than what v10 and v11
>>>>> provide like node_id and fault from which die but thats again very
>>>>> specific to IP_VERSION(9, 4, 3)) i dont know why thats information
>>>>> is not there in v10 and v11.
>>>> I agree to your point but, as of now during a pagefault we are
>>>> dumping this information which is useful like which client
>>>> has generated an interrupt and for which src and other information
>>>> like address. So i think to provide the similar information in the
>>>> devcoredump.
>>>>
>>>> Currently we do not have all this information from either job or vm
>>>> being derived from the job during a reset. We surely could add more
>>>> relevant information later on as per request but this information is
>>>> useful as
>>>> eventually its developers only who would use the dump file provided
>>>> by customer to debug.
>>>>
>>>> Below is the information that i dump in devcore and i feel that is
>>>> good information but new information could be added which could be
>>>> picked later.
>>>>
>>>>> Page fault information
>>>>> [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32773)
>>>>> in page starting at address 0x0000000000000000 from client 0x1b (UTCL2)
>>> This is a perfect example what I mean. You record in the patch is the
>>> client_id, but this is is basically meaningless unless you have access
>>> to the AMD internal hw documentation.
>>>
>>> What you really need is the client in decoded form, in this case
>>> UTCL2. You can keep the client_id additionally, but the decoded client
>>> string is mandatory to have I think.
>>>
>>> Sure i am capturing that information as i am trying to minimise the
>>> memory interaction to minimum as we are still in interrupt context
>>> here that why i recorded the integer information compared to decoding
>> and writing strings there itself but to postpone till we dump.
>>
>> Like decoding to the gfxhub/mmhub based on vmhub/vmid_src and client
>> string from client id. So are we good to go with the information with
>> the above information of sharing details in devcoredump using the
>> additional information from pagefault cached.
> I think amdgpu_vm_fault_info() has everything you need already (vmhub,
> status, and addr). client_id and src_id are just tokens in the
> interrupt cookie so we know which IP to route the interrupt to. We
> know what they will be because otherwise we'd be in the interrupt
> handler for a different IP. I don't think ring_id has any useful
> information in this context and vmid and pasid are probably not too
> useful because they are just tokens to associate the fault with a
> process. It would be better to have the process name.
The decoded client name would be really useful I think since the fault
handled is a catch all and handles a whole bunch of different clients.
But that should be ideally passed in as const string instead of the hw
generation specific client_id.
As long as it's only a pointer we also don't run into the trouble that
we need to allocate memory for it.
Christian.
>
> Alex
>
>> regards
>> sunil
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards
>>>> Sunil Khatri
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Add all such information in the last cached
>>>>>> pagefault from an interrupt handler.
>>>>>>
>>>>>> Signed-off-by: Sunil Khatri <sunil.khatri@....com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 9 +++++++--
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 7 ++++++-
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 2 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +-
>>>>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 2 +-
>>>>>> 7 files changed, 18 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> index 4299ce386322..b77e8e28769d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>>> @@ -2905,7 +2905,7 @@ void amdgpu_debugfs_vm_bo_info(struct
>>>>>> amdgpu_vm *vm, struct seq_file *m)
>>>>>> * Cache the fault info for later use by userspace in debugging.
>>>>>> */
>>>>>> void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
>>>>>> - unsigned int pasid,
>>>>>> + struct amdgpu_iv_entry *entry,
>>>>>> uint64_t addr,
>>>>>> uint32_t status,
>>>>>> unsigned int vmhub)
>>>>>> @@ -2915,7 +2915,7 @@ void amdgpu_vm_update_fault_cache(struct
>>>>>> amdgpu_device *adev,
>>>>>> xa_lock_irqsave(&adev->vm_manager.pasids, flags);
>>>>>> - vm = xa_load(&adev->vm_manager.pasids, pasid);
>>>>>> + vm = xa_load(&adev->vm_manager.pasids, entry->pasid);
>>>>>> /* Don't update the fault cache if status is 0. In the multiple
>>>>>> * fault case, subsequent faults will return a 0 status which is
>>>>>> * useless for userspace and replaces the useful fault
>>>>>> status, so
>>>>>> @@ -2924,6 +2924,11 @@ void amdgpu_vm_update_fault_cache(struct
>>>>>> amdgpu_device *adev,
>>>>>> if (vm && status) {
>>>>>> vm->fault_info.addr = addr;
>>>>>> vm->fault_info.status = status;
>>>>>> + vm->fault_info.client_id = entry->client_id;
>>>>>> + vm->fault_info.src_id = entry->src_id;
>>>>>> + vm->fault_info.vmid = entry->vmid;
>>>>>> + vm->fault_info.pasid = entry->pasid;
>>>>>> + vm->fault_info.ring_id = entry->ring_id;
>>>>>> if (AMDGPU_IS_GFXHUB(vmhub)) {
>>>>>> vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
>>>>>> vm->fault_info.vmhub |=
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> index 047ec1930d12..c7782a89bdb5 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>>>>> @@ -286,6 +286,11 @@ struct amdgpu_vm_fault_info {
>>>>>> uint32_t status;
>>>>>> /* which vmhub? gfxhub, mmhub, etc. */
>>>>>> unsigned int vmhub;
>>>>>> + unsigned int client_id;
>>>>>> + unsigned int src_id;
>>>>>> + unsigned int ring_id;
>>>>>> + unsigned int pasid;
>>>>>> + unsigned int vmid;
>>>>>> };
>>>>>> struct amdgpu_vm {
>>>>>> @@ -605,7 +610,7 @@ static inline void
>>>>>> amdgpu_vm_eviction_unlock(struct amdgpu_vm *vm)
>>>>>> }
>>>>>> void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
>>>>>> - unsigned int pasid,
>>>>>> + struct amdgpu_iv_entry *entry,
>>>>>> uint64_t addr,
>>>>>> uint32_t status,
>>>>>> unsigned int vmhub);
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>>>>> index d933e19e0cf5..6b177ce8db0e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>>>>>> @@ -150,7 +150,7 @@ static int gmc_v10_0_process_interrupt(struct
>>>>>> amdgpu_device *adev,
>>>>>> status = RREG32(hub->vm_l2_pro_fault_status);
>>>>>> WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
>>>>>> - amdgpu_vm_update_fault_cache(adev, entry->pasid, addr,
>>>>>> status,
>>>>>> + amdgpu_vm_update_fault_cache(adev, entry, addr, status,
>>>>>> entry->vmid_src ? AMDGPU_MMHUB0(0) :
>>>>>> AMDGPU_GFXHUB(0));
>>>>>> }
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
>>>>>> index 527dc917e049..bcf254856a3e 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
>>>>>> @@ -121,7 +121,7 @@ static int gmc_v11_0_process_interrupt(struct
>>>>>> amdgpu_device *adev,
>>>>>> status = RREG32(hub->vm_l2_pro_fault_status);
>>>>>> WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
>>>>>> - amdgpu_vm_update_fault_cache(adev, entry->pasid, addr,
>>>>>> status,
>>>>>> + amdgpu_vm_update_fault_cache(adev, entry, addr, status,
>>>>>> entry->vmid_src ? AMDGPU_MMHUB0(0) :
>>>>>> AMDGPU_GFXHUB(0));
>>>>>> }
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>>>>> index 3da7b6a2b00d..e9517ebbe1fd 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
>>>>>> @@ -1270,7 +1270,7 @@ static int gmc_v7_0_process_interrupt(struct
>>>>>> amdgpu_device *adev,
>>>>>> if (!addr && !status)
>>>>>> return 0;
>>>>>> - amdgpu_vm_update_fault_cache(adev, entry->pasid,
>>>>>> + amdgpu_vm_update_fault_cache(adev, entry,
>>>>>> ((u64)addr) << AMDGPU_GPU_PAGE_SHIFT,
>>>>>> status, AMDGPU_GFXHUB(0));
>>>>>> if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_FIRST)
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>>> index d20e5f20ee31..a271bf832312 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>>> @@ -1438,7 +1438,7 @@ static int gmc_v8_0_process_interrupt(struct
>>>>>> amdgpu_device *adev,
>>>>>> if (!addr && !status)
>>>>>> return 0;
>>>>>> - amdgpu_vm_update_fault_cache(adev, entry->pasid,
>>>>>> + amdgpu_vm_update_fault_cache(adev, entry,
>>>>>> ((u64)addr) << AMDGPU_GPU_PAGE_SHIFT,
>>>>>> status, AMDGPU_GFXHUB(0));
>>>>>> if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_FIRST)
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> index 47b63a4ce68b..dc9fb1fb9540 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>>> @@ -666,7 +666,7 @@ static int gmc_v9_0_process_interrupt(struct
>>>>>> amdgpu_device *adev,
>>>>>> rw = REG_GET_FIELD(status, VM_L2_PROTECTION_FAULT_STATUS, RW);
>>>>>> WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
>>>>>> - amdgpu_vm_update_fault_cache(adev, entry->pasid, addr,
>>>>>> status, vmhub);
>>>>>> + amdgpu_vm_update_fault_cache(adev, entry, addr, status, vmhub);
>>>>>> dev_err(adev->dev,
>>>>>> "VM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
Powered by blists - more mailing lists