[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3f9a5e9d-4485-aa46-be31-f561ba5363c8@kernel.dk>
Date: Mon, 20 Apr 2020 10:36:22 -0600
From: Jens Axboe <axboe@...nel.dk>
To: Douglas Anderson <dianders@...omium.org>, jejb@...ux.ibm.com,
martin.petersen@...cle.com
Cc: linux-scsi@...r.kernel.org, Gwendal Grignou <gwendal@...omium.org>,
sqazi@...gle.com, groeck@...omium.org,
Ming Lei <ming.lei@...hat.com>, linux-block@...r.kernel.org,
paolo.valente@...aro.org,
André Almeida <andrealmeid@...labora.com>,
Bart Van Assche <bvanassche@....org>,
Damien Le Moal <damien.lemoal@....com>,
John Garry <john.garry@...wei.com>,
Pavel Begunkov <asml.silence@...il.com>,
Sagi Grimberg <sagi@...mberg.me>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 0/4] blk-mq: Fix two causes of IO stalls found in
reboot testing
On 4/20/20 10:24 AM, Douglas Anderson wrote:
> While doing reboot testing, I found that occasionally my device would
> trigger the hung task detector. Many tasks were stuck waiting for the
> a blkdev mutex, but at least one task in the system was always sitting
> waiting for IO to complete (and holding the blkdev mutex). One
> example of a task that was just waiting for its IO to complete on one
> reboot:
>
> udevd D 0 2177 306 0x00400209
> Call trace:
> __switch_to+0x15c/0x17c
> __schedule+0x6e0/0x928
> schedule+0x8c/0xbc
> schedule_timeout+0x9c/0xfc
> io_schedule_timeout+0x24/0x48
> do_wait_for_common+0xd0/0x160
> wait_for_completion_io_timeout+0x54/0x74
> blk_execute_rq+0x9c/0xd8
> __scsi_execute+0x104/0x198
> scsi_test_unit_ready+0xa0/0x154
> sd_check_events+0xb4/0x164
> disk_check_events+0x58/0x154
> disk_clear_events+0x74/0x110
> check_disk_change+0x28/0x6c
> sd_open+0x5c/0x130
> __blkdev_get+0x20c/0x3d4
> blkdev_get+0x74/0x170
> blkdev_open+0x94/0xa8
> do_dentry_open+0x268/0x3a0
> vfs_open+0x34/0x40
> path_openat+0x39c/0xdf4
> do_filp_open+0x90/0x10c
> do_sys_open+0x150/0x3c8
> ...
>
> I've reproduced this on two systems: one boots from an internal UFS
> disk and one from eMMC. Each has a card reader attached via USB with
> an SD card plugged in. On the USB-attached SD card is a disk with 12
> partitions (a Chrome OS test image), if it matters. The system
> doesn't do much with the USB disk other than probe it (it's plugged in
> my system to help me recover).
>
> From digging, I believe that there are two separate but related
> issues. Both issues relate to the SCSI code saying that there is no
> budget.
>
> I have done testing with only one or the other of the two patches in
> this series and found that I could still encounter hung tasks if only
> one of the two patches was applied. This deserves a bit of
> explanation. To me, it's fairly obvious that the first fix wouldn't
> fix the problems talked about in the second patch. However, it's less
> obvious why the second patch doesn't fix the problems in
> blk_mq_dispatch_rq_list(). It turns out that it _almost_ does
> (problems become much more rare), but I did manage to get a single
> trace where the "kick" scheduled by the second patch happened really
> quickly. The scheduled kick then ran and found nothing to do. This
> happened in parallel to a task running in blk_mq_dispatch_rq_list()
> which hadn't gotten around to splicing the list back into
> hctx->dispatch. This is why we need both fixes.
>
> Most of my testing has been atop Chrome OS 5.4's kernel tree which
> currently has v5.4.30 merged in. The Chrome OS 5.4 tree also has a
> patch by Salman Qazi, namely ("block: Limit number of items taken from
> the I/O scheduler in one go"). Reverting that patch didn't make the
> hung tasks go away, so I kept it in for most of my testing.
>
> I have also done some testing on mainline Linux (most on what git
> describe calls v5.6-rc7-227-gf3e69428b5e2) even without Salman's
> patch. I found that I could reproduce the problems there and that
> traces looked about the same as I saw on the downstream branch. These
> patches were also confirmed to fix the problems on mainline.
>
> Chrome OS is currently setup to use the BFQ scheduler and I found that
> I couldn't reproduce the problems without BFQ. As discussed in the
> second patch this is believed to be because BFQ sometimes returns
> "true" from has_work() but then NULL from dispatch_request().
>
> I'll insert my usual caveat that I'm sending patches to code that I
> know very little about. If I'm making a total bozo patch here, please
> help me figure out how I should fix the problems I found in a better
> way.
>
> If you want to see a total ridiculous amount of chatter where I
> stumbled around a whole bunch trying to figure out what was wrong and
> how to fix it, feel free to read <https://crbug.com/1061950>. I
> promise it will make your eyes glaze over right away if this cover
> letter didn't already do that. Specifically comment 79 in that bug
> includes a link to my ugly prototype of making BFQ's has_work() more
> exact (I only managed it by actually defining _both_ an exact and
> inexact function to avoid circular locking problems when it was called
> directly from blk_mq_hctx_has_pending()). Comment 79 also has more
> thoughts about alternatives considered.
>
> I don't know if these fixes represent a regression of some sort or are
> new. As per above I could only reproduce with BFQ enabled which makes
> it nearly impossible to go too far back with this. I haven't listed
> any "Fixes" tags here, but if someone felt it was appropriate to
> backport this to some stable trees that seems like it'd be nice.
> Presumably at least 5.4 stable would make sense.
>
> Thanks to Salman Qazi, Paolo Valente, and Guenter Roeck who spent a
> bunch of time helping me trawl through some of this code and reviewing
> early versions of this patch.
Applied, thanks.
--
Jens Axboe
Powered by blists - more mailing lists