linux-kernel - 答复: [iscsi] Deadlock occurred when network is in error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <22E823DBB7698E489DC113638F7470729C1AF0@DGGEMM506-MBX.china.huawei.com>
Date:   Tue, 15 Aug 2017 02:16:11 +0000
From:   "Tangchen (UVP)" <tang.chen@...wei.com>
To:     Bart Van Assche <Bart.VanAssche@....com>,
        "lduncan@...e.com" <lduncan@...e.com>,
        "cleech@...hat.com" <cleech@...hat.com>,
        "axboe@...nel.dk" <axboe@...nel.dk>
CC:     "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        guijianfeng <guijianfeng@...wei.com>,
        zhengchuan <zhengchuan@...wei.com>,
        "Tangchen (UVP)" <tang.chen@...wei.com>
Subject: 答复: [iscsi] Deadlock occurred when network is in error

Hi, Bart,

Thank you very much for the quick response. 

But I'm not using mq, and I run into these two problems in a non-mq system.
The patch you pointed out is fix for mq, so I don't think it can resolve this problem.

IIUC, mq is for SSD ?  I'm not using ssd, so mq is disabled.


On Mon, 2017-08-14 at 11:23 +0000, Tangchen (UVP) wrote:
> Problem 2:
> 
> ***************
> [What it looks like]
> ***************
> When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever.
> 
> # cat /proc/19160/stack
> [<ffffffff8005886d>] msleep+0x1d/0x30
> [<ffffffff80201a84>] __blk_drain_queue+0xe4/0x160 [<ffffffff80202766>] 
> blk_cleanup_queue+0x106/0x2e0 [<ffffffffa000fb02>] 
> __scsi_remove_device+0x52/0xc0 [scsi_mod] [<ffffffffa000fb9b>] 
> scsi_remove_device+0x2b/0x40 [scsi_mod] [<ffffffffa000fbc0>] 
> sdev_store_delete_callback+0x10/0x20 [scsi_mod] [<ffffffff801a4e75>] 
> sysfs_schedule_callback_work+0x15/0x80
> [<ffffffff80062d69>] process_one_work+0x169/0x340 [<ffffffff800667e3>] 
> worker_thread+0x183/0x490 [<ffffffff8006a526>] kthread+0x96/0xa0 
> [<ffffffff8041ebb4>] kernel_thread_helper+0x4/0x10 
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The request queue of this device was stopped. So the following check will be true forever:
> __blk_run_queue()
> {
>         if (unlikely(blk_queue_stopped(q)))
>                 return;
> 
>         __blk_run_queue_uncond(q);
> }
> 
> So __blk_run_queue_uncond() will never be called, and the process hang.
> 
> [ ... ]
>
> ****************
> [How to reproduce]
> ****************
> Unfortunately I cannot reproduce it in the latest kernel. 
> The script below will help to reproduce, but not very often.
> 
> # create network error
> tc qdisc add dev eth1 root netem loss 60%
> 
> # restart iscsid and rescan scsi bus again and again while [ 1 ] do 
> systemctl restart iscsid
> rescan-scsi-bus        (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html)
> done

This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck"). The first mainline kernel that includes this commit is kernel v4.11.

> void __blk_run_queue(struct request_queue *q) {
> -       if (unlikely(blk_queue_stopped(q)))
> +       if (unlikely(blk_queue_stopped(q)) && 
> + unlikely(!blk_queue_dying(q)))
>                 return;
> 
>         __blk_run_queue_uncond(q);

Are you aware that the single queue block layer is on its way out and will be removed sooner or later? Please focus your testing on scsi-mq. 

Regarding the above patch: it is wrong because it will cause lockups during path removal for other block drivers. Please drop this patch.

Bart.