linux-kernel - Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b74a5cc3-8174-67f3-17ab-2e8a7d8fa1a6@amd.com>
Date:   Mon, 21 Aug 2023 15:20:54 +0200
From:   Christian König <christian.koenig@....com>
To:     Mirsad Todorovac <mirsad.todorovac@....unizg.hr>,
        linux-kernel@...r.kernel.org
Cc:     Luben Tuikov <luben.tuikov@....com>,
        David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        dri-devel@...ts.freedesktop.org, linux-media@...r.kernel.org,
        linaro-mm-sig@...ts.linaro.org
Subject: Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] /
 drm_sched_entity_push_job [gpu_sched]

Hi Mirsad,

well this is a false positive.

That drm_sched_entity_is_ready() doesn't see the data written by 
drm_sched_entity_push_job() is part of the logic here.

Regards,
Christian.

Am 18.08.23 um 15:44 schrieb Mirsad Todorovac:
> On 8/17/23 21:54, Mirsad Todorovac wrote:
>> Hi,
>>
>> This is your friendly bug reporter.
>>
>> The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS 
>> and a Ryzen 7950X box.
>>
>> Please find attached the complete dmesg output from the ring buffer 
>> and lshw output.
>>
>> NOTE: The kernel reports tainted kernel, but to my knowledge there 
>> are no proprietary (G) modules,
>>        but this taint is turned on by the previous bugs.
>>
>> dmesg excerpt:
>>
>> [ 8791.864576] 
>> ==================================================================
>> [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready 
>> [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>>
>> [ 8791.864776] write (marked) to 0xffff9b74491b7c40 of 8 bytes by 
>> task 3807 on cpu 18:
>> [ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
>> [ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
>> [ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
>> [ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
>> [ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
>> [ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
>> [ 8791.873676]  do_syscall_64+0x58/0x90
>> [ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
>>
>> [ 8791.873710] read to 0xffff9b74491b7c40 of 8 bytes by task 1119 on 
>> cpu 27:
>> [ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
>> [ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
>> [ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
>> [ 8791.873912]  kthread+0x18b/0x1d0
>> [ 8791.873924]  ret_from_fork+0x43/0x70
>> [ 8791.873939]  ret_from_fork_asm+0x1b/0x30
>>
>> [ 8791.873955] value changed: 0x0000000000000000 -> 0xffff9b750ebcfc00
>>
>> [ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
>> [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: 
>> G             L     6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
>> [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG 
>> Lightning, BIOS 1.21 04/26/2023
>> [ 8791.874002] 
>> ==================================================================
>
> P.S.
>
> According to Mr. Heo's instructions, I am adding the unwound trace here:
>
> [ 1879.706518] 
> ==================================================================
> [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready 
> [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>
> [ 1879.706737] write (marked) to 0xffff8f3672748c40 of 8 bytes by task 
> 4087 on cpu 10:
> [ 1879.706748] drm_sched_entity_push_job 
> (./include/drm/spsc_queue.h:74 
> drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
> [ 1879.706808] amdgpu_cs_ioctl 
> (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu
> [ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm
> [ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm
> [ 1879.711551] amdgpu_drm_ioctl 
> (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) amdgpu
> [ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 
> fs/ioctl.c:856 fs/ioctl.c:856)
> [ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80)
> [ 1879.715345] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:120)
>
> [ 1879.715365] read to 0xffff8f3672748c40 of 8 bytes by task 1098 on 
> cpu 11:
> [ 1879.715376] drm_sched_entity_is_ready 
> (drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched
> [ 1879.715435] drm_sched_select_entity 
> (drivers/gpu/drm/scheduler/sched_main.c:248 
> drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched
> [ 1879.715495] drm_sched_main 
> (drivers/gpu/drm/scheduler/sched_main.c:1019) gpu_sched
> [ 1879.715554] kthread (kernel/kthread.c:389)
> [ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145)
> [ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
>
> [ 1879.715590] value changed: 0x0000000000000000 -> 0xffff8f360663dc00
>
> [ 1879.715604] Reported by Kernel Concurrency Sanitizer on:
> [ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: 
> G             L     6.5.0-rc6+ #47
> [ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG 
> Lightning, BIOS 1.21 04/26/2023
> [ 1879.715631] 
> ==================================================================
>
> It seems that the line in question might be:
>
>     first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>
> which expands to:
>
> static inline bool spsc_queue_push(struct spsc_queue *queue, struct 
> spsc_node *node)
> {
>     struct spsc_node **tail;
>
>     node->next = NULL;
>
>     preempt_disable();
>
>     tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, 
> (long)&node->next);
>     WRITE_ONCE(*tail, node);
>     atomic_inc(&queue->job_count);
>
>     /*
>      * In case of first element verify new node will be visible to the 
> consumer
>      * thread when we ping the kernel thread that there is new work to 
> do.
>      */
>     smp_wmb();
>
>     preempt_enable();
>
>     return tail == &queue->head;
> }
>
> According to the manual, preempt_disable() only guaranteed exclusion 
> on a single CPU/core/thread, so
> we might be plagued with the slow, old fashioned locking unless anyone 
> had a better idea.
>
> Best regards,
> Mirsad Todorovac