[<prev] [next>] [day] [month] [year] [list]
Message-ID: <112210b5-8f6b-c039-8f5f-b24ebc078204@univention.de>
Date: Thu, 19 Oct 2017 16:34:13 +0200
From: Philipp Hahn <hahn@...vention.de>
To: Ankur Arora <ankur.a.arora@...cle.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
xen-devel <xen-devel@...ts.xenproject.org>
Subject: Re: [Xen-devel] 4.9.52: INFO: task X blocked for more than 300
seconds.
Hello Ankur,
Am 05.10.2017 um 19:59 schrieb Ankur Arora:
> On 2017-10-05 06:20 AM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Oct 04, 2017 at 08:26:27PM +0200, Philipp Hahn wrote:
...
>> Adding Ankur to this as I think he saw something similar.
>>
>> But in the meantime - do you see this with the latest version of Linux?
>>> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed
>>> several stuck processes: Here is one (truncated) dump of the Linux
>>> kernel messages:
>>>
>>>> [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0
>>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>> [<ffffffff8160e032>] ? schedule+0x32/0x80
>>>> [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360
>>>> [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below
>>>> [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20
>>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>> [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130
>>>> [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80
>>>> [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60
>>>> [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90
>>>> [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>> [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0
>>>> [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40
>>>> [<ffffffffc00abd18>] ?
>>>> jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2]
>>>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>> [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80
>>>> [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2]
>>>> [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30
>>>> [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2]
>>>> [<ffffffff81097660>] ? kthread+0xf0/0x110
>>>> [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>> [<ffffffff81097570>] ? kthread_park+0x60/0x60
>>>> [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> This looks like this race: https://patchwork.kernel.org/patch/9853443/
I built a new kernel, for which I picked that patch on top of 4.9.56. We
are currently testing that, but it crashed again yesterday evening. Here
is the dmesg output:
> INFO: task systemd:1 blocked for more than 120 seconds.
> Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201710161640
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> systemd D 0 1 0 0x00000000
> ffff8801f1797c00 0000000000000000 ffffffff81c0e540 ffff8801f4956040
> ffff8801f5a187c0 ffffc90040c4f880 ffffffff8160ebbd ffffffff81186295
> ffffffff81350b39 0000000002cebf80 ffffffff811863c8 ffff8801f4956040
> Call Trace:
> [<ffffffff8160ebbd>] ? __schedule+0x23d/0x6d0
> [<ffffffff81186295>] ? move_freepages+0x95/0xd0
> [<ffffffff81350b39>] ? list_del+0x9/0x20
> [<ffffffff811863c8>] ? __rmqueue+0x88/0x3e0
> [<ffffffff8160f082>] ? schedule+0x32/0x80
> [<ffffffff8161255c>] ? schedule_timeout+0x1ec/0x360
> [<ffffffff81187530>] ? get_page_from_freelist+0x350/0xad0
> [<ffffffff8160e904>] ? io_schedule_timeout+0xb4/0x130
> [<ffffffff81364fd4>] ? __sbitmap_queue_get+0x24/0x90
> [<ffffffff813114f9>] ? bt_get.isra.6+0x129/0x1c0
> [<ffffffff81350b39>] ? list_del+0x9/0x20
> [<ffffffff810bba00>] ? wake_up_atomic_t+0x30/0x30
> [<ffffffff81311843>] ? blk_mq_get_tag+0x23/0x90
> [<ffffffff8130d1aa>] ? __blk_mq_alloc_request+0x1a/0x220
> [<ffffffff8130dffd>] ? blk_mq_map_request+0xcd/0x170
> [<ffffffff8131016a>] ? blk_sq_make_request+0xca/0x4c0
> [<ffffffff81303e1a>] ? generic_make_request_checks+0x22a/0x4f0
> [<ffffffff81304761>] ? generic_make_request+0x121/0x2c0
> [<ffffffff8117e823>] ? __add_to_page_cache_locked+0x183/0x230
> [<ffffffff81304976>] ? submit_bio+0x76/0x150
> [<ffffffff8117e964>] ? add_to_page_cache_lru+0x84/0xe0
> [<ffffffffc017b9d9>] ? ext4_mpage_readpages+0x2b9/0x8b0 [ext4]
> [<ffffffff811d94fa>] ? alloc_pages_current+0x8a/0x110
> [<ffffffff8118e745>] ? __do_page_cache_readahead+0x195/0x240
> [<ffffffff8117f627>] ? pagecache_get_page+0x27/0x2b0
> [<ffffffff811818e6>] ? filemap_fault+0x276/0x590
> [<ffffffffc0138491>] ? ext4_filemap_fault+0x31/0x50 [ext4]
> [<ffffffff811b23a4>] ? __do_fault+0x84/0x190
> [<ffffffff811b783e>] ? handle_mm_fault+0xede/0x1680
> [<ffffffff8124d25e>] ? ep_poll+0x13e/0x360
> [<ffffffff8105feda>] ? __do_page_fault+0x26a/0x500
> [<ffffffff81205ba2>] ? SyS_read+0x52/0xc0
> [<ffffffff81614d58>] ? page_fault+0x28/0x30
I haven't been able to get the address of the queue object yet to get
its state. (timeout for today)
> Can you dump the output of: cat /sys/block/$xen-frontend-device/mq/*/tags
>
> If you've hit this bug, one or more of the MQs would be wedged and
> the nr_free in one or more of the queues would be 0 and will not
> change.
As soon as the bug occurs, we can no longer access the VM via ssh or the
Xen (serial) console: the connection stalls after entering 2-3 characters.
I have a Xen crash-dump file of one such crash, but following
/sys/block/xvd?/mq/*/tags manually to get the kobject address using
"crash" (gdb) is very time consuming. So far I only did it once for
xvda, but I have to to that for the other 15 block devices as well to
find one culprit.
Philipp Hahn
--
Philipp Hahn
Open Source Software Engineer
Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@...vention.de
http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-0287
View attachment "blk_sq_make_request.txt" of type "text/plain" (9721 bytes)
View attachment "blk_sq_make_request.s" of type "text/plain" (26172 bytes)
Powered by blists - more mailing lists