linux-kernel - Re: kernel BUG at drivers/scsi/scsi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <s5h37vi40be.wl-tiwai@suse.de>
Date:	Fri, 04 Dec 2015 17:59:17 +0100
From:	Takashi Iwai <tiwai@...e.de>
To:	Hannes Reinecke <hare@...e.de>
Cc:	Mike Snitzer <snitzer@...hat.com>, emilne@...hat.com,
	Christoph Hellwig <hch@...radead.org>,
	Michael Ellerman <mpe@...erman.id.au>,
	Mark Salter <msalter@...hat.com>,
	"James E. J. Bottomley" <JBottomley@...n.com>,
	brking <brking@...ibm.com>, linux-scsi@...r.kernel.org,
	linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
	linux-block@...r.kernel.org,
	Jun'ichi Nomura <j-nomura@...jp.nec.com>,
	Jens Axboe <axboe@...com>
Subject: Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!

On Wed, 25 Nov 2015 20:01:47 +0100,
Hannes Reinecke wrote:
> 
> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> > On Wed, Nov 25 2015 at  4:04am -0500,
> > Hannes Reinecke <hare@...e.de> wrote:
> >
> >> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>> Can't we have a joint effort here?
> >>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>> none of the ideas I've come up with have been able to fix anything.
> >>>
> >>> Yes.  I'm not the one primarily looking at it, and we don't have a
> >>> reproducer in-house.  We just have the one dump right now.
> >>>
> >>>>
> >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>> by one and be done with ...
> >>>>
> >>>
> >>> That might not fix it if it is a problem with the merge code, though.
> >>>
> >> And indeed, it doesn't.
> >
> > How did you arrive at that?  Do you have a reproducer now?
> >
> Not a reproducer, but several dumps for analysis.
> 
> >> Seems I finally found the culprit.
> >>
> >> What happens is this:
> >> We have two paths, with these seg_boundary_masks:
> >>
> >> path-1:    seg_boundary_mask = 65535,
> >> path-2:    seg_boundary_mask = 4294967295,
> >>
> >> consequently the DM request queue has this:
> >>
> >> md-1:    seg_boundary_mask = 65535,
> >>
> >> What happens now is that a request is being formatted, and sent
> >> to path 2. During submission req->nr_phys_segments is formatted
> >> with the limits of path 2, arriving at a count of 3.
> >> Now the request gets retried on path 1, but as the NOMERGE request
> >> flag is set req->nr_phys_segments is never updated.
> >> But blk_rq_map_sg() ignores all counters, and just uses the
> >> bi_vec directly, resulting in a count of 4 -> boom.
> >>
> >> So the culprit here is the NOMERGE flag,
> >
> > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >
> Yes.
> 
> >> which is evaluated via
> >> ->dm_dispatch_request()
> >>    ->blk_insert_cloned_request()
> >>      ->blk_rq_check_limits()
> >
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> > anyway after reading your mail I'm still left wondering if your proposed
> > patch is correct.
> >
> >> If the above assessment is correct, the following patch should
> >> fix it:
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 801ced7..12cccd6 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>    */
> >>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>   {
> >> -       if (!rq_mergeable(rq))
> >> +       if (rq->cmd_type != REQ_TYPE_FS)
> >>                  return 0;
> >>
> >>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >> rq->cmd_flags)) {
> >>
> >>
> >> Mike? Jens?
> >> Can you comment on it?
> >
> > You're not explaining the actual change in the patch very well; I think
> > you're correct but you're leaving the justification as an exercise to
> > the reviewer:
> >
> > blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> > !rq_mergeable() check but you're saying for this case in question we
> > never get there -- due to the cloned request having NOMERGE set.
> >
> > So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >
> > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> > sense for cloned requests that always have NOMERGE set.
> >
> > So you're saying that by having blk_rq_check_limits() go on to call
> > blk_recalc_rq_segments() this bug will be fixed?
> >
> That is the idea.
> 
> I've already established that in all instances I have seen so far
> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
> 
> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> for the cloned request.
> So each cloned request inherits the values from the original request,
> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> the final request_queue context, as the queue limits _might_ be 
> different from the original (merged) queue limits of the multipath
> request queue.
> 
> > BTW, I think blk_rq_check_limits()'s export should be removed and the
> > function made static and renamed to blk_clone_rq_check_limits(), again:
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >
> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is 
> pointless, too, so we might as well remove the entire if-clause.
> 
> > Seems prudent to make that change now to be clear that this code is only
> > used by cloned requests.
> >
> Yeah, that would make sense. I'll be preparing a patch.
> With a more detailed description :-)

Do we have already a fix?  Right now I got (likely) this kernel BUG()
on the almost latest Linus tree (commit 25364a9e54fb8296).  It
happened while I started a KVM right after a fresh boot.  The machine
paniced even before that, so I hit this twice today.

I have a crash dump, so if anything is needed, let me know.
(But the system had to be restarted since it's my main workstation.)


thanks,

Takashi

===
[  137.211255] ------------[ cut here ]------------
[  137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0()
[  137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[  137.211326]  mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[  137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183
[  137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[  137.211349] Workqueue: kblockd bfq_kick_queue
[  137.211352]  ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000
[  137.211357]  ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000
[  137.211362]  0000000000000000 0000000000001000 0000000000000020 ffff880214307b00
[  137.211367] Call Trace:
[  137.211373]  [<ffffffff813b1cc9>] dump_stack+0x4b/0x72
[  137.211378]  [<ffffffff8106d9a2>] warn_slowpath_common+0x82/0xc0
[  137.211383]  [<ffffffff8106da9a>] warn_slowpath_null+0x1a/0x20
[  137.211387]  [<ffffffff81381340>] blk_rq_map_sg+0x430/0x4d0
[  137.211393]  [<ffffffff81514a60>] ? scsi_init_cmd_errh+0x90/0x90
[  137.211398]  [<ffffffff81514fc4>] scsi_init_sgtable+0x44/0x80
[  137.211402]  [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[  137.211408]  [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[  137.211413]  [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[  137.211416]  [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[  137.211420]  [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[  137.211424]  [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[  137.211428]  [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[  137.211433]  [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[  137.211438]  [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[  137.211442]  [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[  137.211447]  [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[  137.211451]  [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[  137.211456]  [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[  137.211460]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.211464]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.211468]  [<ffffffff8109206e>] kthread+0xfe/0x120
[  137.211473]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.211478]  [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[  137.211482]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.211485] ---[ end trace b39a62f95b1cad19 ]---
[  137.211499] ------------[ cut here ]------------
[  137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096!
[  137.211511] invalid opcode: 0000 [#1] PREEMPT SMP 
[  137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[  137.211630]  mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[  137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G        W       4.4.0-rc3-test2+ #183
[  137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[  137.211675] Workqueue: kblockd bfq_kick_queue
[  137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000
[  137.211713] RIP: 0010:[<ffffffff81514ff2>]  [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[  137.211747] RSP: 0018:ffff880214307b90  EFLAGS: 00010097
[  137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006
[  137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009
[  137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000
[  137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320
[  137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0
[  137.211930] FS:  0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
[  137.211961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0
[  137.212021] Stack:
[  137.212048]  ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0
[  137.212082]  ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0
[  137.212115]  000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020
[  137.212148] Call Trace:
[  137.212178]  [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[  137.212205]  [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[  137.212232]  [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[  137.212258]  [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[  137.212285]  [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[  137.212312]  [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[  137.212338]  [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[  137.212365]  [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[  137.212392]  [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[  137.212418]  [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[  137.212444]  [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[  137.212471]  [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[  137.212498]  [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[  137.212525]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.212552]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.212578]  [<ffffffff8109206e>] kthread+0xfe/0x120
[  137.212605]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.212631]  [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[  137.212657]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 
[  137.212798] RIP  [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[  137.212826]  RSP <ffff880214307b90>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/