linux-kernel - Re: blkdev_issue_discard() hangs forever if the underlying storage device is removed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 30 Aug 2011 12:01:00 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Bart Van Assche <bvanassche@....org>
Cc:	Lukas Czerner <lczerner@...hat.com>,
	Jens Axboe <jaxboe@...ionio.com>,
	Mike Snitzer <snitzer@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: blkdev_issue_discard() hangs forever if the underlying storage
 device is removed

On Mon, Aug 29, 2011 at 07:56:33PM +0200, Bart Van Assche wrote:
> On Mon, Aug 29, 2011 at 1:56 PM, Lukas Czerner <lczerner@...hat.com> wrote:
> > On Sat, 27 Aug 2011, Bart Van Assche wrote:
> >
> >> Apparently blkdev_issue_discard() never times out, not even if the
> >> device has been removed. This is what appeared in the kernel log after
> >> device removal (triggered by running mkfs.ext4 on an SRP SCSI device
> >> node):
> >>
> >> sd 15:0:0:0: [sdb] Attached SCSI disk
> >> scsi host15: SRP abort called
> >> scsi host15: SRP reset_device called
> >> scsi host15: ib_srp: SRP reset_host called
> >> scsi host15: ib_srp: connection closed
> >> scsi host15: ib_srp: Got failed path rec status -110
> >> scsi host15: ib_srp: Path record query failed
> >> scsi host15: ib_srp: reconnect failed (-110), removing target port.
> >> sd 15:0:0:0: Device offlined - not ready after error recovery
> >> INFO: task mkfs.ext4:4304 blocked for more than 120 seconds.
> >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> mkfs.ext4       D 0000000000000000     0 4304 3649 0x00000000
> >>  ffff88006c313b98 0000000000000046 ffffffff813e3038 ffffffff81e6b580
> >>  0000000000000082 000000010003cfdc ffff88006c313fd8 ffff880070fbcbc0
> >>  00000000001d1f40 ffff88006c313fd8 ffff88006c312000 ffff88006c312000
> >> Call Trace:
> >>  [<ffffffff813e3038>] ? schedule+0x628/0x830
> >>  [<ffffffff813e3835>] schedule_timeout+0x1d5/0x310
> >>  [<ffffffff810805de>] ? put_lock_stats+0xe/0x40
> >>  [<ffffffff81080e05>] ? lock_release_holdtime+0xb5/0x160
> >>  [<ffffffff813e6ac0>] ? _raw_spin_unlock_irq+0x30/0x60
> >>  [<ffffffff8103f7d9>] ? sub_preempt_count+0xa9/0xe0
> >>  [<ffffffff813e28e0>] wait_for_common+0x110/0x160
> >>  [<ffffffff810425f0>] ? try_to_wake_up+0x2c0/0x2c0
> >>  [<ffffffff813e2a0d>] wait_for_completion+0x1d/0x20
> >>  [<ffffffff811de93a>] blkdev_issue_discard+0x27a/0x2c0
> >>  [<ffffffff813e2806>] ? wait_for_common+0x36/0x160
> >>  [<ffffffff811df371>] blkdev_ioctl+0x701/0x760
> >>  [<ffffffff8112b7bf>] ? kmem_cache_free+0x6f/0x160
> >>  [<ffffffff811755b7>] block_ioctl+0x47/0x50
> >>  [<ffffffff81151b78>] do_vfs_ioctl+0x98/0x570
> >>  [<ffffffff813e76dc>] ? sysret_check+0x27/0x62
> >>  [<ffffffff8115209f>] sys_ioctl+0x4f/0x80
> >>  [<ffffffff813e76ab>] system_call_fastpath+0x16/0x1b
> >> no locks held by mkfs.ext4/4304.
> >>
> >> The above message kept repeating forever until system reboot.
> >>
> >> Kernel version:
> >> $ git show | head -n 1
> >> commit ed8f37370d83e695c0a4fa5d5fc7a83ecb947526
> >> $ git describe
> >> v3.0-7216-ged8f373
> >>
> >> I'm considering this as a bug because the state described above makes it
> >> impossible to kill the mkfs process and also makes it impossible to remove the
> >> kernel module ib_srp. That's why I also reported this as
> >> https://bugzilla.kernel.org/show_bug.cgi?id=40472.
> >
> > I am trying to find some race condition that would cause the problem in
> > blkdev_issue_discard(), however I can not see anything.
> 
> I'm not sure why you are looking for a race condition - this looks
> like a plain deadlock to me.
> 
> > The only reason for this to happen I can see is that the last bio was
> > not completed yet (e.g. the bio_batch_end_io() callback has not been
> > called by the last submitted bio). Does bios have some sort of timeout
> > after it dies out? Is it possible that we can lose bio like that ?
> 
> A key fact here is that the block device to which the discard request
> was issued is gone, so the discard request will never finish
> successfully. Do all relevant error paths guarantee that
> blkdev_issue_discard() will finish in a finite time ?

The underlying block device driver is supposed to handle timing out
of lost IOs and causcwinge them to be completed with an error.
blkdev_issue_discard() is simply waiting for that error to be
delivered.

If the driver has not detected that an outstanding request has not
completed then that is a driver bug, not a bug in
blkdev_issue_discard(). IOWs, you should be asking the ib_srp people
why the in flight bio was not timed out or errored out when the
block device abort was run....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/