[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201806131038.02793.randrianasulu@gmail.com>
Date: Wed, 13 Jun 2018 10:38:02 +0300
From: Andrew Randrianasulu <randrianasulu@...il.com>
To: "jianchao.wang" <jianchao.w.wang@...cle.com>
Cc: Bart Van Assche <Bart.VanAssche@....com>,
"rdunlap@...radead.org" <rdunlap@...radead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
"hch@....de" <hch@....de>,
"linux-block@...r.kernel.org" <linux-block@...r.kernel.org>
Subject: Re: kernel BUG at drivers/scsi/scsi_error.c:197! - git 4.17.0-x64-08428-g7d3bf613e99a
В сообщении от Wednesday 13 June 2018 07:03:47 jianchao.wang написал(а):
> On 06/12/2018 11:28 PM, Bart Van Assche wrote:
> > On Sat, 2018-06-09 at 08:02 -0700, Randy Dunlap wrote:
> >> On 06/09/2018 06:06 AM, Andrew Randrianasulu wrote:
> >>> Hello, everyone!
> >>>
> >>> I was trying to test not-yet-completed 4.18-rc1, and run into this bug,
> >>> probably due to HAL (yes, I still use this) trying to poll empty DVD-RW
> >>> drive:
> >>>
> >>> [ 35.040167] ------------[ cut here ]------------
> >>> [ 35.040171] kernel BUG at drivers/scsi/scsi_error.c:197!
> >>> [ 35.040196] invalid opcode: 0000 [#1] SMP NOPTI
> >>> [ 35.040210] CPU: 2 PID: 117 Comm: kworker/2:1H Not tainted
> >>> 4.17.0-x64-08428-g7d3bf613e99a-dirty #18
> >>> [ 35.040225] Hardware name: MSI MS-7693/970A-G46 (MS-7693), BIOS V2.8
> >>> 01/08/2016
> >>> [ 35.040244] Workqueue: kblockd blk_mq_timeout_work
> >>> [ 35.040283] RIP: 0010:scsi_times_out+0xe9/0x1a6 [scsi_mod]
> >>> [ 35.040287] Code: 03 76 18 48 c7 c2 f9 c2 01 a0 48 89 ee 48 c7 c7 14
> >>> be 01 a0 31 c0 e8 76 ad 00 00 48 8b 83 98 01 00 00 a8 01 0f 84 a5 00 00
> >>> 00 <0f> 0b 49 8b 7c 24 38 e8 8f ba 4b e1 41 83 bc 24 2c 01
> >>> 00 00 ff 74
> >>> [ 35.040312] RSP: 0018:ffffc90000293d98 EFLAGS: 00010202
> >>> [ 35.040318] RAX: ffff88031f240605 RBX: ffff88031eed8500 RCX:
> >>> 00000000fffefd40 [ 35.040323] RDX: ffffc90000293e80 RSI:
> >>> 0000000000002007 RDI: ffff88031eed8638 [ 35.040327] RBP:
> >>> ffff88031eed8638 R08: ffff88031eed85d8 R09: 0000000000000000 [
> >>> 35.040332] R10: 0000000000000040 R11: fefefefefefefeff R12:
> >>> ffff88031f2a8800 [ 35.040336] R13: ffff88031f2a8800 R14:
> >>> ffff88031f29e910 R15: 0000000000000001 [ 35.040341] FS:
> >>> 0000000000000000(0000) GS:ffff88032ed00000(0000) knlGS:0000000000000000
> >>> [ 35.040346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [ 35.040350] CR2: 00000000f7ea83c0 CR3: 000000031e194000 CR4:
> >>> 00000000000406e0 [ 35.040354] Call Trace:
> >>> [ 35.040360] blk_mq_check_expired+0xdc/0x10c
> >>> [ 35.040365] bt_iter+0x42/0x45
> >>> [ 35.040369] __sbitmap_for_each_set.constprop.12+0x83/0xac
> >>> [ 35.040374] ? blk_mq_update_nr_requests+0xad/0xad
> >>> [ 35.040595] ? blk_mq_exit_hctx+0xda/0xda
> >>> [ 35.041595] blk_mq_queue_tag_busy_iter+0xa3/0xb4
> >>> [ 35.042578] ? blk_mq_exit_hctx+0xda/0xda
> >>> [ 35.043550] blk_mq_timeout_work+0x6b/0xca
> >>> [ 35.044505] process_one_work+0x17c/0x2a6
> >>> [ 35.045445] worker_thread+0x19f/0x243
> >>> [ 35.046393] ? rescuer_thread+0x262/0x262
> >>> [ 35.047346] kthread+0x100/0x108
> >>> [ 35.048290] ? kthread_destroy_worker+0x3e/0x3e
> >>> [ 35.049242] ret_from_fork+0x27/0x50
> >>> [ 35.050192] Modules linked in: bridge stp llc bnep rfcomm hidp
> >>> snd_aloop ipv6
>
> .....
>
> > Is this something that happens systematically or sporadically?
> > Additionally, have you verified whether or not this can be triggered with
> > kernel v4.17? I'm wondering whether or not this is a regression that was
> > introduced in the merge window.
>
> I suspect this is due to we could expire a same request twice or even more.
> For scsi mid-layer, it return BLK_EH_DONE from .timeout, in fact, the
> request is not completed there, but just queue a delayed abort_work
> (HZ/100). If the blk_mq_timeout_work runs again before the abort_work, the
> request will be timed out again, because there is not any mark on it to
> identify this request has been timed out.
>
> Would please try the patch attached on to see whether this issue could be
> fixed ? (this patch only works for scsi device currently)
with this patch on top of 4.17.0-x64-11730-gf5b7769eb040-dirty I can
suspend/resume my machine again, no errors in dmesg (attached)
Thanks!
>
> Thanks
> Jianchao
View attachment "dmesg_4.17-git_patched.log" of type "text/x-log" (61458 bytes)
Powered by blists - more mailing lists