linux-kernel - Re: [BUG][v6.9-rc6] Deadlock with: Revert "drm/qxl: simplify qxl_fence

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240504043957.417aa98c@rorschach.local.home>
Date: Sat, 4 May 2024 04:39:57 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: LKML <linux-kernel@...r.kernel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Alex Constantino
 <dreaming.about.electric.sheep@...il.com>, Maxime Ripard
 <mripard@...nel.org>, Timo Lindfors <timo.lindfors@....fi>, Dave Airlie
 <airlied@...hat.com>, Gerd Hoffmann <kraxel@...hat.com>, Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Thomas Zimmermann
 <tzimmermann@...e.de>, Daniel Vetter <daniel@...ll.ch>
Subject: Re: [BUG][v6.9-rc6] Deadlock with: Revert "drm/qxl: simplify
 qxl_fence_wait"


Did anyone see this?

-- Steve


On Thu, 2 May 2024 08:16:41 -0400
Steven Rostedt <rostedt@...dmis.org> wrote:

> I went to run my tests on my VMs and the tests hung on boot up.
> Unfortunately, the most I ever got out was:
> 
> [   93.607888] Testing event system initcall: OK
> [   93.667730] Running tests on all trace events:
> [   93.669757] Testing all events: OK
> [   95.631064] ------------[ cut here ]------------
> Timed out after 60 seconds
> 
> I ran a bisect and it came up with:
> 
>  # first bad commit: [07ed11afb68d94eadd4ffc082b97c2331307c5ea] Revert "drm/qxl: simplify qxl_fence_wait"
> 
> I checked out 07ed11afb68d94eadd4ffc082b97c2331307c5ea~1 and it booted
> fine. Added back that commit, it failed to boot. I did this twice, and got
> the same results.
> 
> But the last time I ran it, it did trigger this:
> 
>  ------------[ cut here ]------------
>  
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  6.9.0-rc1-test-00021-g07ed11afb68d #5 Not tainted
>  ------------------------------------------------------
>  kworker/u24:3/119 is trying to acquire lock:
>  ffffffff95aa4600 (console_owner){....}-{0:0}, at: console_flush_all+0x1f5/0x530
>  
>  but task is already holding lock:
>  ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
>  
>  which lock already depends on the new lock.
>  
>  
>  the existing dependency chain (in reverse order) is:
>  
>  -> #1 (&pool->lock){-.-.}-{2:2}:  
>         _raw_spin_lock+0x33/0x40
>         __queue_work+0xd6/0x610
>         queue_work_on+0x8a/0x90
>         soft_cursor+0x1a0/0x230
>         bit_cursor+0x386/0x5f0
>         hide_cursor+0x27/0xb0
>         vt_console_print+0x474/0x490
>         console_flush_all+0x22e/0x530
>         console_unlock+0x56/0x160
>         vprintk_emit+0x160/0x390
>         dev_printk_emit+0xa5/0xd0
>         _dev_info+0x79/0xa0
>         __drm_fb_helper_initial_config_and_unlock+0x3a9/0x5f0
>         drm_fbdev_generic_client_hotplug+0x69/0xc0
>         drm_client_register+0x7b/0xc0
>         qxl_pci_probe+0x107/0x1a0
>         local_pci_probe+0x45/0xa0
>         pci_device_probe+0xc7/0x240
>         really_probe+0xd6/0x390
>         __driver_probe_device+0x78/0x150
>         driver_probe_device+0x1f/0x90
>         __driver_attach+0xd6/0x1d0
>         bus_for_each_dev+0x8f/0xe0
>         bus_add_driver+0x119/0x220
>         driver_register+0x59/0x100
>         do_one_initcall+0x76/0x3c0
>         kernel_init_freeable+0x3a5/0x5b0
>         kernel_init+0x1a/0x1c0
>         ret_from_fork+0x34/0x50
>         ret_from_fork_asm+0x1a/0x30
>  
>  -> #0 (console_owner){....}-{0:0}:  
>         __lock_acquire+0x13e7/0x2180
>         lock_acquire+0xd9/0x300
>         console_flush_all+0x212/0x530
>         console_unlock+0x56/0x160
>         vprintk_emit+0x160/0x390
>         _printk+0x64/0x80
>         __warn_printk+0x8e/0x180
>         check_flush_dependency+0xfd/0x120
>         __flush_work+0xfa/0x440
>         qxl_queue_garbage_collect+0x83/0x90
>         qxl_fence_wait+0xa4/0x1a0
>         dma_fence_wait_timeout+0x98/0x1e0
>         dma_resv_wait_timeout+0x7f/0xe0
>         ttm_bo_delayed_delete+0x2b/0x90
>         process_one_work+0x228/0x740
>         worker_thread+0x1dc/0x3c0
>         kthread+0xf2/0x120
>         ret_from_fork+0x34/0x50
>         ret_from_fork_asm+0x1a/0x30
>  
>  other info that might help us debug this:
>  
>   Possible unsafe locking scenario:
>  
>         CPU0                    CPU1
>         ----                    ----
>    lock(&pool->lock);
>                                 lock(console_owner);
>                                 lock(&pool->lock);
>    lock(console_owner);
>  
>   *** DEADLOCK ***
>  
>  6 locks held by kworker/u24:3/119:
>   #0: ffff93c440245948 ((wq_completion)ttm){+.+.}-{0:0}, at: process_one_work+0x43a/0x740
>   #1: ffffa01380d83e60 ((work_completion)(&bo->delayed_delete)){+.+.}-{0:0}, at: process_one_work+0x1e2/0x740
>   #2: ffffffff95b17880 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x86/0x440
>   #3: ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
>   #4: ffffffff95b149c0 (console_lock){+.+.}-{0:0}, at: _printk+0x64/0x80
>   #5: ffffffff95b14a10 (console_srcu){....}-{0:0}, at: console_flush_all+0x7b/0x530
>  
>  stack backtrace:
>  CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  Workqueue: ttm ttm_bo_delayed_delete
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x77/0xb0
>   check_noncircular+0x148/0x160
>   __lock_acquire+0x13e7/0x2180
>   lock_acquire+0xd9/0x300
>   ? console_flush_all+0x1f5/0x530
>   ? lock_release+0x147/0x2c0
>   ? console_flush_all+0x1f5/0x530
>   console_flush_all+0x212/0x530
>   ? console_flush_all+0x1f5/0x530
>   console_unlock+0x56/0x160
>   vprintk_emit+0x160/0x390
>   _printk+0x64/0x80
>   ? __pfx_ttm_bo_delayed_delete+0x10/0x10
>   ? __pfx_qxl_gc_work+0x10/0x10
>   __warn_printk+0x8e/0x180
>   ? __pfx_ttm_bo_delayed_delete+0x10/0x10
>   ? __pfx_qxl_gc_work+0x10/0x10
>   ? __pfx_qxl_gc_work+0x10/0x10
>   check_flush_dependency+0xfd/0x120
>   __flush_work+0xfa/0x440
>   qxl_queue_garbage_collect+0x83/0x90
>   qxl_fence_wait+0xa4/0x1a0
>   dma_fence_wait_timeout+0x98/0x1e0
>   dma_resv_wait_timeout+0x7f/0xe0
>   ttm_bo_delayed_delete+0x2b/0x90
>   process_one_work+0x228/0x740
>   worker_thread+0x1dc/0x3c0
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xf2/0x120
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x34/0x50
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete is flushing !WQ_MEM_RECLAIM events:qxl_gc_work
>  WARNING: CPU: 2 PID: 119 at kernel/workqueue.c:3728 check_flush_dependency+0xfd/0x120
>  Modules linked in:
>  CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>  Workqueue: ttm ttm_bo_delayed_delete
>  RIP: 0010:check_flush_dependency+0xfd/0x120
>  Code: 8b 45 18 48 8d b2 70 01 00 00 49 89 e8 48 8d 8b 70 01 00 00 48 c7 c7 60 46 7b 95 c6 05 48 67 d2 01 01 48 89 c2 e8 63 40 fd ff <0f> 0b e9 1e ff ff ff 80 3d 33 67 d2 01 00 75 93 e9 4a ff ff ff 66
>  RSP: 0000:ffffa01380d83c28 EFLAGS: 00010086
>  RAX: 0000000000000000 RBX: ffff93c44004ee00 RCX: 0000000000000000
>  RDX: 0000000080000003 RSI: 00000000ffffefff RDI: 0000000000000001
>  RBP: ffffffff9497b100 R08: 0000000000000000 R09: 0000000000000003
>  R10: ffffa01380d83ab8 R11: ffffffff95b14828 R12: ffff93c443980000
>  R13: ffff93c440fbe300 R14: 0000000000000001 R15: ffff93c44004ee00
>  FS:  0000000000000000(0000) GS:ffff93c4bbd00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000000 CR3: 000000007c864001 CR4: 0000000000170ef0
>  Call Trace:
>   <TASK>
>   ? __warn+0x8c/0x180
>   ? check_flush_dependency+0xfd/0x120
>   ? report_bug+0x191/0x1c0
>   ? prb_read_valid+0x1b/0x30
>   ? handle_bug+0x3c/0x80
>   ? exc_invalid_op+0x17/0x70
>   ? asm_exc_invalid_op+0x1a/0x20
>   ? __pfx_qxl_gc_work+0x10/0x10
>   ? check_flush_dependency+0xfd/0x120
>   ? check_flush_dependency+0xfd/0x120
>   __flush_work+0xfa/0x440
>   qxl_queue_garbage_collect+0x83/0x90
>   qxl_fence_wait+0xa4/0x1a0
>   dma_fence_wait_timeout+0x98/0x1e0
>   dma_resv_wait_timeout+0x7f/0xe0
>   ttm_bo_delayed_delete+0x2b/0x90
>   process_one_work+0x228/0x740
>   worker_thread+0x1dc/0x3c0
>   ? __pfx_worker_thread+0x10/0x10
>   kthread+0xf2/0x120
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x34/0x50
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  irq event stamp: 58
>  hardirqs last  enabled at (57): [<ffffffff93fede30>] queue_work_on+0x60/0x90
>  hardirqs last disabled at (58): [<ffffffff94ea7f66>] _raw_spin_lock_irq+0x56/0x60
>  softirqs last  enabled at (0): [<ffffffff93fbae27>] copy_process+0xc07/0x2c60
>  softirqs last disabled at (0): [<0000000000000000>] 0x0
>  ---[ end trace 0000000000000000 ]---
> 
> So there's an issue with dma_fence and a workqueue.
> 
> -- Steve
>