linux-kernel - Re: [PATCH] drm/radeon: add a force flush to delay work when radeon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2f38b94b-0965-80f2-5bae-840765ffc4da@kylinos.cn>
Date:   Wed, 17 Aug 2022 15:31:32 +0800
From:   李真能 <lizhenneng@...inos.cn>
To:     Christian König <christian.koenig@....com>,
        Christian König <ckoenig.leichtzumerken@...il.com>,
        Alex Deucher <alexander.deucher@....com>
Cc:     David Airlie <airlied@...ux.ie>, Pan Xinhui <Xinhui.Pan@....com>,
        linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        amd-gfx@...ts.freedesktop.org, Daniel Vetter <daniel@...ll.ch>
Subject: Re: [PATCH] drm/radeon: add a force flush to delay work when radeon


在 2022/8/15 21:12, Christian König 写道:
> Am 15.08.22 um 09:34 schrieb 李真能:
>>
>> 在 2022/8/12 18:55, Christian König 写道:
>>> Am 11.08.22 um 09:25 schrieb Zhenneng Li:
>>>> Although radeon card fence and wait for gpu to finish processing 
>>>> current batch rings,
>>>> there is still a corner case that radeon lockup work queue may not 
>>>> be fully flushed,
>>>> and meanwhile the radeon_suspend_kms() function has called 
>>>> pci_set_power_state() to
>>>> put device in D3hot state.
>>>
>>> If I'm not completely mistaken the reset worker uses the 
>>> suspend/resume functionality as well to get the hardware into a 
>>> working state again.
>>>
>>> So if I'm not completely mistaken this here would lead to a 
>>> deadlock, please double check that.
>>
>> We have tested many times, there are no deadlock.
>
> Testing doesn't tells you anything, you need to audit the call paths.
>
>> In which situation, there would lead to a deadlock?
>
> GPU resets.

Although flush_delayed_work(&rdev->fence_drv[i].lockup_work) will wait 
for a lockup_work to finish executing the last queueing,  but this 
kernel func haven't get any lock, and lockup_work will run in another 
kernel thread, so I think flush_delayed_work could not lead to a deadlock.

Therefor if radeon_gpu_reset is called in another thread when 
radeon_suspend_kms is blocked on flush_delayed_work, there could not 
lead to a deadlock.

>
> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Per PCI spec rev 4.0 on 5.3.1.4.1 D3hot State.
>>>>> Configuration and Message requests are the only TLPs accepted by a 
>>>>> Function in
>>>>> the D3hot state. All other received Requests must be handled as 
>>>>> Unsupported Requests,
>>>>> and all received Completions may optionally be handled as 
>>>>> Unexpected Completions.
>>>> This issue will happen in following logs:
>>>> Unable to handle kernel paging request at virtual address 
>>>> 00008800e0008010
>>>> CPU 0 kworker/0:3(131): Oops 0
>>>> pc = [<ffffffff811bea5c>]  ra = [<ffffffff81240844>]  ps = 0000 
>>>> Tainted: G        W
>>>> pc is at si_gpu_check_soft_reset+0x3c/0x240
>>>> ra is at si_dma_is_lockup+0x34/0xd0
>>>> v0 = 0000000000000000  t0 = fff08800e0008010  t1 = 0000000000010000
>>>> t2 = 0000000000008010  t3 = fff00007e3c00000  t4 = fff00007e3c00258
>>>> t5 = 000000000000ffff  t6 = 0000000000000001  t7 = fff00007ef078000
>>>> s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 = fff00007e3c00018
>>>> s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 = 0000000000000000
>>>> s6 = fff00007ef07bd98
>>>> a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 = 0000000000000008
>>>> a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 = ffffffff810f4338
>>>> t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 = ff6769c5d964b800
>>>> t11= 000000000000b886  pv = ffffffff811bea20  at = 0000000000000000
>>>> gp = ffffffff81d89690  sp = 00000000aa814126
>>>> Disabling lock debugging due to kernel taint
>>>> Trace:
>>>> [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
>>>> [<ffffffff81119610>] radeon_fence_check_lockup+0xd0/0x290
>>>> [<ffffffff80977010>] process_one_work+0x280/0x550
>>>> [<ffffffff80977350>] worker_thread+0x70/0x7c0
>>>> [<ffffffff80977410>] worker_thread+0x130/0x7c0
>>>> [<ffffffff80982040>] kthread+0x200/0x210
>>>> [<ffffffff809772e0>] worker_thread+0x0/0x7c0
>>>> [<ffffffff80981f8c>] kthread+0x14c/0x210
>>>> [<ffffffff80911658>] ret_from_kernel_thread+0x18/0x20
>>>> [<ffffffff80981e40>] kthread+0x0/0x210
>>>>   Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8 40230101
>>>>   <88210000> 4821ed21
>>>> So force lockup work queue flush to fix this problem.
>>>>
>>>> Signed-off-by: Zhenneng Li <lizhenneng@...inos.cn>
>>>> ---
>>>>   drivers/gpu/drm/radeon/radeon_device.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>> index 15692cb241fc..e608ca26780a 100644
>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>> @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct drm_device 
>>>> *dev, bool suspend,
>>>>           if (r) {
>>>>               /* delay GPU reset to resume */
>>>>               radeon_fence_driver_force_completion(rdev, i);
>>>> +        } else {
>>>> +            /* finish executing delayed work */
>>>> + flush_delayed_work(&rdev->fence_drv[i].lockup_work);
>>>>           }
>>>>       }
>>>
>