linux-kernel - Re: [PATCH v2] drm/panfrost:report the full raw fault information instead

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ebd984f1-19bd-4943-cbe6-2541aaa64946@arm.com>
Date:   Mon, 28 Jun 2021 15:17:51 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     Steven Price <steven.price@....com>,
        Chunyou Tang <tangchunyou@....com>
Cc:     tomeu.vizoso@...labora.com, airlied@...ux.ie,
        linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        alyssa.rosenzweig@...labora.com,
        ChunyouTang <tangchunyou@...becorp.cn>
Subject: Re: [PATCH v2] drm/panfrost:report the full raw fault information
 instead

On 2021-06-28 11:48, Steven Price wrote:
> On 25/06/2021 10:49, Chunyou Tang wrote:
>> Hi Steve,
>> 	Thinks for your reply.
>> 	When I only set the pte |= ARM_LPAE_PTE_SH_NS;there have no "GPU
>> Fault",When I set the pte |= ARM_LPAE_PTE_SH_IS(or
>> ARM_LPAE_PTE_SH_OS);there have "GPU Fault".I don't know how the pte
>> effect this issue?
>> 	Can you give me some suggestions again?
>>
>> Thinks.
>>
>> Chunyou
> 
> Hi Chunyou,
> 
> You haven't given me much context so I'm not entirely sure which PTE you
> are talking about (GPU or CPU), or indeed where you are changing the PTE
> values.
> 
> The PTEs control whether a page is shareable or not, the GPU requires
> that accesses are consistent (i.e. either all accesses to a page are
> shareable or all are non-shareable) and will race a fault if it detects
> this isn't the case. Mali also has a quirk for its version of 'LPAE'
> where inner shareable actually means only within the GPU and outer
> shareable means outside the GPU (which I think usually means Inner
> Shareable on the external bus).

Furthermore, the way the io-pgtable code works for ARM_MALI_LPAE format 
  means that *all* GPU mappings are unconditionally outer-shareable, so 
it's not clear how the GPU could observe a mismatch in the first place 
(other than major integration issues causing data corruption).

Robin.

> 
> Steve
> 
>> 于 Thu, 24 Jun 2021 14:22:04 +0100
>> Steven Price <steven.price@....com> 写道:
>>
>>> On 22/06/2021 02:40, Chunyou Tang wrote:
>>>> Hi Steve,
>>>> 	I will send a new patch with suitable subject/commit
>>>> message. But I send a V3 or a new patch?
>>>
>>> Send a V3 - it is a new version of this patch.
>>>
>>>> 	I met a bug about the GPU,I have no idea about how to fix
>>>> it, If you can give me some suggestion,it is perfect.
>>>>
>>>> You can see such kernel log:
>>>>
>>>> Jun 20 10:20:13 icube kernel: [  774.566760] mvp_gpu 0000:05:00.0:
>>>> GPU Fault 0x00000088 (SHAREABILITY_FAULT) at 0x000000000310fd00 Jun
>>>> 20 10:20:13 icube kernel: [  774.566764] mvp_gpu 0000:05:00.0:
>>>> There were multiple GPU faults - some have not been reported Jun 20
>>>> 10:20:13 icube kernel: [  774.667542] mvp_gpu 0000:05:00.0:
>>>> AS_ACTIVE bit stuck Jun 20 10:20:13 icube kernel: [  774.767900]
>>>> mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:13 icube
>>>> kernel: [  774.868546] mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck
>>>> Jun 20 10:20:13 icube kernel: [  774.968910] mvp_gpu 0000:05:00.0:
>>>> AS_ACTIVE bit stuck Jun 20 10:20:13 icube kernel: [  775.069251]
>>>> mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:22 icube
>>>> kernel: [  783.693971] mvp_gpu 0000:05:00.0: gpu sched timeout,
>>>> js=1, config=0x7300, status=0x8, head=0x362c900, tail=0x362c100,
>>>> sched_job=000000003252fb84
>>>>
>>>> In
>>>> https://lore.kernel.org/dri-devel/20200510165538.19720-1-peron.clem@gmail.com/
>>>> there had a same bug like mine,and I found you at the mail list,I
>>>> don't know how it fixed?
>>>
>>> The GPU_SHAREABILITY_FAULT error means that a cache line has been
>>> accessed both as shareable and non-shareable and therefore coherency
>>> cannot be guaranteed. Although the "multiple GPU faults" means that
>>> this may not be the underlying cause.
>>>
>>> The fact that your dmesg log has PCI style identifiers
>>> ("0000:05:00.0") suggests this is an unusual platform - I've not
>>> previously been aware of a Mali device behind PCI. Is this device
>>> working with the kbase/DDK proprietary driver? It would be worth
>>> looking at the kbase kernel code for the platform to see if there is
>>> anything special done for the platform.
>>>
>>>  From the dmesg logs all I can really tell is that the GPU seems
>>> unhappy about the memory system.
>>>
>>> Steve
>>>
>>>> I need your help!
>>>>
>>>> thinks very much!
>>>>
>>>> Chunyou
>>>>
>>>> 于 Mon, 21 Jun 2021 11:45:20 +0100
>>>> Steven Price <steven.price@....com> 写道:
>>>>
>>>>> On 19/06/2021 04:18, Chunyou Tang wrote:
>>>>>> Hi Steve,
>>>>>> 	1,Now I know how to write the subject
>>>>>> 	2,the low 8 bits is the exception type in spec.
>>>>>>
>>>>>> and you can see prnfrost_exception_name()
>>>>>>
>>>>>> switch (exception_code) {
>>>>>>                  /* Non-Fault Status code */
>>>>>> case 0x00: return "NOT_STARTED/IDLE/OK";
>>>>>> case 0x01: return "DONE";
>>>>>> case 0x02: return "INTERRUPTED";
>>>>>> case 0x03: return "STOPPED";
>>>>>> case 0x04: return "TERMINATED";
>>>>>> case 0x08: return "ACTIVE";
>>>>>> ........
>>>>>> ........
>>>>>> case 0xD8: return "ACCESS_FLAG";
>>>>>> case 0xD9 ... 0xDF: return "ACCESS_FLAG";
>>>>>> case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
>>>>>> case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
>>>>>> }
>>>>>> return "UNKNOWN";
>>>>>> }
>>>>>>
>>>>>> the exception_code in case is only 8 bits,so if fault_status
>>>>>> in panfrost_gpu_irq_handler() don't & 0xFF,it can't get correct
>>>>>> exception reason,it will be always UNKNOWN.
>>>>>
>>>>> Yes, I'm happy with the change - I just need a patch that I can
>>>>> apply. At the moment this patch only changes the first '0x%08x'
>>>>> output rather than the call to panfrost_exception_name() as well.
>>>>> So we just need a patch which does:
>>>>>
>>>>> - fault_status & 0xFF, panfrost_exception_name(pfdev,
>>>>> fault_status),
>>>>> + fault_status, panfrost_exception_name(pfdev, fault_status &
>>>>> 0xFF),
>>>>>
>>>>> along with a suitable subject/commit message describing the
>>>>> change. If you can send me that I can apply it.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>> PS. Sorry for going round in circles here - I'm trying to help you
>>>>> get setup so you'll be able to contribute patches easily in
>>>>> future. An important part of that is ensuring you can send a
>>>>> properly formatted patch to the list.
>>>>>
>>>>> PPS. I'm still not receiving your emails directly. I don't think
>>>>> it's a problem at my end because I'm receiving other emails, but
>>>>> if you can somehow fix the problem you're likely to receive a
>>>>> faster response.
>>>>>
>>>>>> 于 Fri, 18 Jun 2021 13:43:24 +0100
>>>>>> Steven Price <steven.price@....com> 写道:
>>>>>>
>>>>>>> On 17/06/2021 07:20, ChunyouTang wrote:
>>>>>>>> From: ChunyouTang <tangchunyou@...becorp.cn>
>>>>>>>>
>>>>>>>> of the low 8 bits.
>>>>>>>
>>>>>>> Please don't split the subject like this. The first line of the
>>>>>>> commit should be a (very short) summary of the patch. Then a
>>>>>>> blank line and then a longer description of what the purpose of
>>>>>>> the patch is and why it's needed.
>>>>>>>
>>>>>>> Also you previously had this as part of a series (the first part
>>>>>>> adding the "& 0xFF" in the panfrost_exception_name() call). I'm
>>>>>>> not sure we need two patches for the single line, but as it
>>>>>>> stands this patch doesn't apply.
>>>>>>>
>>>>>>> Also I'm still not receiving any emails from you directly (only
>>>>>>> via the list), so it's possible I might have missed something
>>>>>>> you sent.
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>>>
>>>>>>>> Signed-off-by: ChunyouTang <tangchunyou@...becorp.cn>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/panfrost/panfrost_gpu.c | 2 +-
>>>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c
>>>>>>>> b/drivers/gpu/drm/panfrost/panfrost_gpu.c index
>>>>>>>> 1fffb6a0b24f..d2d287bbf4e7 100644 ---
>>>>>>>> a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++
>>>>>>>> b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -33,7 +33,7 @@
>>>>>>>> static irqreturn_t panfrost_gpu_irq_handler(int irq, void
>>>>>>>> *data) address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
>>>>>>>>   		dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at
>>>>>>>> 0x%016llx\n",
>>>>>>>> -			 fault_status & 0xFF,
>>>>>>>> panfrost_exception_name(pfdev, fault_status & 0xFF),
>>>>>>>> +			 fault_status,
>>>>>>>> panfrost_exception_name(pfdev, fault_status & 0xFF), address);
>>>>>>>>   
>>>>>>>>   		if (state & GPU_IRQ_MULTIPLE_FAULT)
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>