linux-kernel - Re: Lockdep warning after `mdadm -S`

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5e70658f-d071-bcec-74ce-214f12da396e@suse.com>
Date:   Wed, 10 Jun 2020 14:00:22 +0300
From:   Nikolay Borisov <nborisov@...e.com>
To:     Michał Mirosław <mirq-linux@...e.qmqm.pl>,
        linux-btrfs@...r.kernel.org, linux-raid@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Lockdep warning after `mdadm -S`



On 10.06.20 г. 10:19 ч., Michał Mirosław wrote:
> Dear Developers,
> 
> I found a lockdep warning in dmesg some after doing 'mdadm -S' while
> also having btrfs mounted (light to none I/O load).  Disks under MD and
> btrfs are unrelated.

Huhz, I think that's genuine, because btrfs and md shared at least the bd_mutex 
and the workqueue. So the scenario could go something along the lines of: 

					   
 					   						 
T1: 						T2:					T3:							T4:					T5:
 initiates process of stopping md,													transaction commit blocked on   		User tries to (erroneously) mount mddev 
 holds bd_mutex, open_mutex, 								wants to begin a new transaction but 		device_list_mutex, callstack #3,		but blocks on callstack at #4 since T1 
 blocks on flushing md_misc_wq,		Should begin  mmdev_delayed_delete		blocks on callstack #0 due to a 		holding sb_internal				holds bd_open  and is holding device_list_mutex
 Callstack #6				but never does because workqueue is  		running transaction commit in T4
					blocked due to T3 being blocked	                NB: This happens from writeback		   
					 					       	context, ie. same as T2 workqueue				  
              				  					        bd_mutex held by T1 	

So T5 blocks T4, which blocks T3, which blocks the shared writeback workqueue, 
this prevents T2 from running which when done would cause T1 to unlock bd_mutex, 
which would unblock T5. I think this is generally possible but highly unlikely.

Also looking at the code in T5 (callstack #4 below) it seems that the same could happen if 
scan ioctl is sent for the mddev. Discussing this with peterz he proposed the following half-baked 
patch: https://paste.debian.net/1151314/

The idea is to remove the md_open mutex which would break the dependency chain between 
#4->#6. What do mdraid people think?
											
> 
> Best Regards,
> Michał Mirosław
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.7.1mq+ #383 Tainted: G           O     
> ------------------------------------------------------
> kworker/u16:3/8175 is trying to acquire lock:
> ffff8882f19556a0 (sb_internal#3){.+.+}-{0:0}, at: start_transaction+0x37e/0x550 [btrfs]
> 
> but task is already holding lock:
> ffffc900087c7e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x235/0x620
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
>        __flush_work+0x331/0x490
>        wb_shutdown+0x8f/0xb0
>        bdi_unregister+0x72/0x1f0
>        del_gendisk+0x2b0/0x2c0
>        md_free+0x28/0x90
>        kobject_put+0xa6/0x1b0
>        process_one_work+0x2b6/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #7 ((work_completion)(&mddev->del_work)){+.+.}-{0:0}:
>        process_one_work+0x28d/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #6 ((wq_completion)md_misc){+.+.}-{0:0}:
>        flush_workqueue+0xa9/0x4e0
>        __md_stop_writes+0x18/0x100
>        do_md_stop+0x165/0x2d0
>        md_ioctl+0xa52/0x1d60
>        blkdev_ioctl+0x1cc/0x2a0
>        block_ioctl+0x3a/0x40
>        ksys_ioctl+0x81/0xc0
>        __x64_sys_ioctl+0x11/0x20
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #5 (&mddev->open_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        md_open+0x43/0xc0
>        __blkdev_get+0xea/0x560
>        blkdev_get+0x60/0x130
>        do_dentry_open+0x147/0x3e0
>        path_openat+0x84f/0xa80
>        do_filp_open+0x8e/0x100
>        do_sys_openat2+0x225/0x2e0
>        do_sys_open+0x46/0x80
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #4 (&bdev->bd_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        __blkdev_get+0x77/0x560
>        blkdev_get+0x60/0x130
>        blkdev_get_by_path+0x41/0x80
>        btrfs_get_bdev_and_sb+0x16/0xb0 [btrfs]
>        open_fs_devices+0x9d/0x240 [btrfs]
>        btrfs_open_devices+0x89/0x90 [btrfs]
>        btrfs_mount_root+0x26a/0x4b0 [btrfs]
>        legacy_get_tree+0x2b/0x50
>        vfs_get_tree+0x23/0xc0
>        fc_mount+0x9/0x40
>        vfs_kern_mount.part.40+0x57/0x80
>        btrfs_mount+0x148/0x3f0 [btrfs]
>        legacy_get_tree+0x2b/0x50
>        vfs_get_tree+0x23/0xc0
>        do_mount+0x712/0xa40
>        __x64_sys_mount+0xbf/0xe0
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #3 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_run_dev_stats+0x44/0x470 [btrfs]
>        commit_cowonly_roots+0xac/0x2a0 [btrfs]
>        btrfs_commit_transaction+0x511/0xa70 [btrfs]
>        transaction_kthread+0x13c/0x160 [btrfs]
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #2 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_commit_transaction+0x4b6/0xa70 [btrfs]
>        transaction_kthread+0x13c/0x160 [btrfs]
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> -> #1 (&fs_info->reloc_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x93/0x9c0
>        btrfs_record_root_in_trans+0x3e/0x60 [btrfs]
>        start_transaction+0xcb/0x550 [btrfs]
>        btrfs_mkdir+0x5c/0x1e0 [btrfs]
>        vfs_mkdir+0x107/0x1d0
>        do_mkdirat+0xe7/0x110
>        do_syscall_64+0x4f/0x210
>        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> -> #0 (sb_internal#3){.+.+}-{0:0}:
>        __lock_acquire+0x11f9/0x1aa0
>        lock_acquire+0x9e/0x380
>        __sb_start_write+0x13a/0x270
>        start_transaction+0x37e/0x550 [btrfs]
>        cow_file_range_inline.constprop.74+0xe4/0x640 [btrfs]
>        cow_file_range+0xe5/0x3f0 [btrfs]
>        btrfs_run_delalloc_range+0x128/0x620 [btrfs]
>        writepage_delalloc+0xe2/0x140 [btrfs]
>        __extent_writepage+0x1a3/0x370 [btrfs]
>        extent_write_cache_pages+0x2b8/0x470 [btrfs]
>        extent_writepages+0x3f/0x90 [btrfs]
>        do_writepages+0x3c/0xe0
>        __writeback_single_inode+0x4f/0x650
>        writeback_sb_inodes+0x1f7/0x560
>        __writeback_inodes_wb+0x58/0xa0
>        wb_writeback+0x33b/0x4b0
>        wb_workfn+0x428/0x5b0
>        process_one_work+0x2b6/0x620
>        worker_thread+0x35/0x3e0
>        kthread+0x143/0x160
>        ret_from_fork+0x3a/0x50
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   sb_internal#3 --> (work_completion)(&mddev->del_work) --> (work_completion)(&(&wb->dwork)->work)
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock((work_completion)(&(&wb->dwork)->work));
>                                lock((work_completion)(&mddev->del_work));
>                                lock((work_completion)(&(&wb->dwork)->work));
>   lock(sb_internal#3);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kworker/u16:3/8175:
>  #0: ffff88840baa6948 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x235/0x620
>  #1: ffffc900087c7e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x235/0x620
>  #2: ffff8882f19550e8 (&type->s_umount_key#52){++++}-{3:3}, at: trylock_super+0x11/0x50
> 
> stack backtrace:
> CPU: 1 PID: 8175 Comm: kworker/u16:3 Tainted: G           O      5.7.1mq+ #383
> Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
> Workqueue: writeback wb_workfn (flush-btrfs-1)
> Call Trace:
>  dump_stack+0x71/0xa0
>  check_noncircular+0x165/0x180
>  ? stack_trace_save+0x46/0x70
>  __lock_acquire+0x11f9/0x1aa0
>  lock_acquire+0x9e/0x380
>  ? start_transaction+0x37e/0x550 [btrfs]
>  __sb_start_write+0x13a/0x270
>  ? start_transaction+0x37e/0x550 [btrfs]
>  start_transaction+0x37e/0x550 [btrfs]
>  ? kmem_cache_alloc+0x1b0/0x2c0
>  cow_file_range_inline.constprop.74+0xe4/0x640 [btrfs]
>  ? lock_acquire+0x9e/0x380
>  ? test_range_bit+0x3d/0x130 [btrfs]
>  cow_file_range+0xe5/0x3f0 [btrfs]
>  btrfs_run_delalloc_range+0x128/0x620 [btrfs]
>  ? find_lock_delalloc_range+0x1f3/0x220 [btrfs]
>  writepage_delalloc+0xe2/0x140 [btrfs]
>  __extent_writepage+0x1a3/0x370 [btrfs]
>  extent_write_cache_pages+0x2b8/0x470 [btrfs]
>  ? __lock_acquire+0x3fc/0x1aa0
>  extent_writepages+0x3f/0x90 [btrfs]
>  do_writepages+0x3c/0xe0
>  ? find_held_lock+0x2d/0x90
>  __writeback_single_inode+0x4f/0x650
>  writeback_sb_inodes+0x1f7/0x560
>  __writeback_inodes_wb+0x58/0xa0
>  wb_writeback+0x33b/0x4b0
>  wb_workfn+0x428/0x5b0
>  ? sched_clock_cpu+0xe/0xd0
>  process_one_work+0x2b6/0x620
>  ? worker_thread+0xc7/0x3e0
>  worker_thread+0x35/0x3e0
>  ? process_one_work+0x620/0x620
>  kthread+0x143/0x160
>  ? kthread_park+0x80/0x80
>  ret_from_fork+0x3a/0x50
>