linux-kernel - Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d321918e-6f3b-4984-9163-427b579dc57e@amd.com>
Date:   Mon, 21 Aug 2023 20:22:12 +0200
From:   Christian König <christian.koenig@....com>
To:     Mirsad Todorovac <mirsad.todorovac@....unizg.hr>,
        linux-kernel@...r.kernel.org
Cc:     Luben Tuikov <luben.tuikov@....com>,
        David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        dri-devel@...ts.freedesktop.org, linux-media@...r.kernel.org,
        linaro-mm-sig@...ts.linaro.org
Subject: Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] /
 drm_sched_entity_push_job [gpu_sched]

I'm not sure about that.

On the one hand it might generate some noise. I know tons of cases where 
logic is: Ok if we see the updated value immediately it will optimize 
things, but if not it's unproblematic because there is another check 
after the next memory barrier.

On the other hand we probably have cases where this is not correctly 
implemented. So double checking those would most like be good idea.

Regards,
Christian.

Am 21.08.23 um 16:28 schrieb Mirsad Todorovac:
> Hi Christian,
>
> Thank you for the update.
>
> Should I continue reporting what KCSAN gives? I will try to filter 
> these to save your time for
> evaluation ...
>
> Kind regards,
> Mirsad
>
> On 8/21/23 15:20, Christian König wrote:
>> Hi Mirsad,
>>
>> well this is a false positive.
>>
>> That drm_sched_entity_is_ready() doesn't see the data written by 
>> drm_sched_entity_push_job() is part of the logic here.
>>
>> Regards,
>> Christian.
>>
>> Am 18.08.23 um 15:44 schrieb Mirsad Todorovac:
>>> On 8/17/23 21:54, Mirsad Todorovac wrote:
>>>> Hi,
>>>>
>>>> This is your friendly bug reporter.
>>>>
>>>> The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS 
>>>> and a Ryzen 7950X box.
>>>>
>>>> Please find attached the complete dmesg output from the ring buffer 
>>>> and lshw output.
>>>>
>>>> NOTE: The kernel reports tainted kernel, but to my knowledge there 
>>>> are no proprietary (G) modules,
>>>>        but this taint is turned on by the previous bugs.
>>>>
>>>> dmesg excerpt:
>>>>
>>>> [ 8791.864576] 
>>>> ==================================================================
>>>> [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready 
>>>> [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>>>>
>>>> [ 8791.864776] write (marked) to 0xffff9b74491b7c40 of 8 bytes by 
>>>> task 3807 on cpu 18:
>>>> [ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
>>>> [ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
>>>> [ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
>>>> [ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
>>>> [ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
>>>> [ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
>>>> [ 8791.873676]  do_syscall_64+0x58/0x90
>>>> [ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
>>>>
>>>> [ 8791.873710] read to 0xffff9b74491b7c40 of 8 bytes by task 1119 
>>>> on cpu 27:
>>>> [ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
>>>> [ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
>>>> [ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
>>>> [ 8791.873912]  kthread+0x18b/0x1d0
>>>> [ 8791.873924]  ret_from_fork+0x43/0x70
>>>> [ 8791.873939]  ret_from_fork_asm+0x1b/0x30
>>>>
>>>> [ 8791.873955] value changed: 0x0000000000000000 -> 0xffff9b750ebcfc00
>>>>
>>>> [ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
>>>> [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: 
>>>> G             L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
>>>> [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG 
>>>> Lightning, BIOS 1.21 04/26/2023
>>>> [ 8791.874002] 
>>>> ==================================================================
>>>
>>> P.S.
>>>
>>> According to Mr. Heo's instructions, I am adding the unwound trace 
>>> here:
>>>
>>> [ 1879.706518] 
>>> ==================================================================
>>> [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready 
>>> [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>>>
>>> [ 1879.706737] write (marked) to 0xffff8f3672748c40 of 8 bytes by 
>>> task 4087 on cpu 10:
>>> [ 1879.706748] drm_sched_entity_push_job 
>>> (./include/drm/spsc_queue.h:74 
>>> drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
>>> [ 1879.706808] amdgpu_cs_ioctl 
>>> (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu
>>> [ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm
>>> [ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm
>>> [ 1879.711551] amdgpu_drm_ioctl 
>>> (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) amdgpu
>>> [ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 
>>> fs/ioctl.c:856 fs/ioctl.c:856)
>>> [ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 
>>> arch/x86/entry/common.c:80)
>>> [ 1879.715345] entry_SYSCALL_64_after_hwframe 
>>> (arch/x86/entry/entry_64.S:120)
>>>
>>> [ 1879.715365] read to 0xffff8f3672748c40 of 8 bytes by task 1098 on 
>>> cpu 11:
>>> [ 1879.715376] drm_sched_entity_is_ready 
>>> (drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched
>>> [ 1879.715435] drm_sched_select_entity 
>>> (drivers/gpu/drm/scheduler/sched_main.c:248 
>>> drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched
>>> [ 1879.715495] drm_sched_main 
>>> (drivers/gpu/drm/scheduler/sched_main.c:1019) gpu_sched
>>> [ 1879.715554] kthread (kernel/kthread.c:389)
>>> [ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145)
>>> [ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
>>>
>>> [ 1879.715590] value changed: 0x0000000000000000 -> 0xffff8f360663dc00
>>>
>>> [ 1879.715604] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: 
>>> G             L     6.5.0-rc6+ #47
>>> [ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG 
>>> Lightning, BIOS 1.21 04/26/2023
>>> [ 1879.715631] 
>>> ==================================================================
>>>
>>> It seems that the line in question might be:
>>>
>>>     first = spsc_queue_push(&entity->job_queue, 
>>> &sched_job->queue_node);
>>>
>>> which expands to:
>>>
>>> static inline bool spsc_queue_push(struct spsc_queue *queue, struct 
>>> spsc_node *node)
>>> {
>>>     struct spsc_node **tail;
>>>
>>>     node->next = NULL;
>>>
>>>     preempt_disable();
>>>
>>>     tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, 
>>> (long)&node->next);
>>>     WRITE_ONCE(*tail, node);
>>>     atomic_inc(&queue->job_count);
>>>
>>>     /*
>>>      * In case of first element verify new node will be visible to 
>>> the consumer
>>>      * thread when we ping the kernel thread that there is new work 
>>> to do.
>>>      */
>>>     smp_wmb();
>>>
>>>     preempt_enable();
>>>
>>>     return tail == &queue->head;
>>> }
>>>
>>> According to the manual, preempt_disable() only guaranteed exclusion 
>>> on a single CPU/core/thread, so
>>> we might be plagued with the slow, old fashioned locking unless 
>>> anyone had a better idea.
>>>
>>> Best regards,
>>> Mirsad Todorovac