[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFz9U7TAWtpQPfgG0u=6_jtVUhK5ahPGGnGy6GK8RZ-0XM2pfg@mail.gmail.com>
Date: Thu, 25 Aug 2016 23:52:07 -0700
From: Roy Yang <roy@...esity.com>
To: linux-ext4@...r.kernel.org
Subject: Ext4 stuck at wait_transaction_locked
Hi Ext4 experts,
I need your help to debug one ext4 issue. We consistently see Ext4
stuck at wait_transaction_locked after another process is killed by
cgroup because of oom. We have two processes keeping writing data to
the same disk, and one was killed because of oom; the other process
will stall at all I/O operations pretty soon.
After did some search online, and found a couple of similar reports.
Not sure whether this is a known issue.
The following is the system information and part of dmesg, and I also
have the vmcore generated by sysrq+c.
[tmp]$ uname -a
Linux sedhaswell04-node-1 3.10.0-327.22.2.el7.cohesity.x86_64 #1 SMP
Tue Jul 5 12:41:09 PDT 2016 x86_64 x86_64 x86_64 GNU/Linux
[tmp]$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)
[96766.113014] Killed process 81593 (java) total-vm:19626048kB,
anon-rss:1424920kB, file-rss:5144kB
[97047.933383] INFO: task jbd2/sdb1-8:3229 blocked for more than 120 seconds.
[97047.941093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[97047.949879] jbd2/sdb1-8 D ffff88103a587000 0 3229 2 0x00000080
[97047.949883] ffff88104ad1bc90 0000000000000046 ffff88085129d080
ffff88104ad1bfd8
[97047.949886] ffff88104ad1bfd8 ffff88104ad1bfd8 ffff88085129d080
ffff88104ad1bda0
[97047.949888] ffff881040d2e0c0 ffff88085129d080 ffff88104ad1bd88
ffff88103a587000
[97047.949891] Call Trace:
[97047.949899] [<ffffffff8163b7f9>] schedule+0x29/0x70
[97047.949924] [<ffffffffa0201118>]
jbd2_journal_commit_transaction+0x248/0x19a0 [jbd2]
[97047.949932] [<ffffffff81013588>] ? __switch_to+0xf8/0x4b0
[97047.949935] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[97047.949941] [<ffffffff8108d6fe>] ? try_to_del_timer_sync+0x5e/0x90
[97047.949947] [<ffffffffa0206d79>] kjournald2+0xc9/0x260 [jbd2]
[97047.949949] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[97047.949953] [<ffffffffa0206cb0>] ? commit_timeout+0x10/0x10 [jbd2]
[97047.949957] [<ffffffff810a5aef>] kthread+0xcf/0xe0
[97047.949959] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[97047.949963] [<ffffffff816467d8>] ret_from_fork+0x58/0x90
[97047.949965] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[97047.950005] INFO: task kworker/u386:2:69088 blocked for more than
120 seconds.
[97047.958096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[97047.966869] kworker/u386:2 D ffff88104ab27120 0 69088 2 0x00000080
[97047.966878] Workqueue: writeback bdi_writeback_workfn (flush-8:16)
[97047.966880] ffff880b8b6638e8 0000000000000046 ffff8810482be780
ffff880b8b663fd8
[97047.966897] ffff880b8b663fd8 ffff880b8b663fd8 ffff8810482be780
ffff881040d2e000
[97047.966900] ffff881040d2e078 00000000000fb086 ffff88103a587000
ffff88104ab27120
[97047.966902] Call Trace:
[97047.966905] [<ffffffff8163b7f9>] schedule+0x29/0x70
[97047.966911] [<ffffffffa01fe085>] wait_transaction_locked+0x85/0xd0 [jbd2]
[97047.966913] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[97047.966917] [<ffffffffa01fe400>] start_this_handle+0x2b0/0x5d0 [jbd2]
[97047.966925] [<ffffffffa02c8f03>] ? _scsih_qcmd+0x2d3/0x420 [mpt3sas]
[97047.966928] [<ffffffff811c153a>] ? kmem_cache_alloc+0x1ba/0x1d0
[97047.966932] [<ffffffffa01fe933>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
[97047.966946] [<ffffffffa0377fc4>] ? ext4_writepages+0x434/0xd60 [ext4]
[97047.966956] [<ffffffffa03a8ea9>] __ext4_journal_start_sb+0x69/0xe0 [ext4]
[97047.966962] [<ffffffffa0377fc4>] ext4_writepages+0x434/0xd60 [ext4]
[97047.966967] [<ffffffff81174c38>] ? generic_writepages+0x58/0x80
[97047.966970] [<ffffffff81175cde>] do_writepages+0x1e/0x40
[97047.966972] [<ffffffff81208a20>] __writeback_single_inode+0x40/0x220
[97047.966974] [<ffffffff8120948e>] writeback_sb_inodes+0x25e/0x420
[97047.966976] [<ffffffff812096ef>] __writeback_inodes_wb+0x9f/0xd0
[97047.966978] [<ffffffff81209f33>] wb_writeback+0x263/0x2f0
[97047.966980] [<ffffffff810a0d96>] ? set_worker_desc+0x86/0xb0
[97047.966982] [<ffffffff8120c005>] bdi_writeback_workfn+0x115/0x460
[97047.966985] [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[97047.966986] [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[97047.966988] [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[97047.966990] [<ffffffff810a5aef>] kthread+0xcf/0xe0
[97047.966992] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[97047.966994] [<ffffffff816467d8>] ret_from_fork+0x58/0x90
[97047.966996] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[97047.967003] INFO: task bridge_exec:79205 blocked for more than 120 seconds.
[97047.974801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
>From crash dump, all the process stuck at UN state has similar backtrace:
: 69088 TASK: ffff8810482be780 CPU: 13 COMMAND: "kworker/u386:2"
#0 [ffff880b8b663888] __schedule at ffffffff8163b15d
#1 [ffff880b8b6638f0] schedule at ffffffff8163b7f9
#2 [ffff880b8b663900] wait_transaction_locked at ffffffffa01fe085 [jbd2]
#3 [ffff880b8b663958] start_this_handle at ffffffffa01fe400 [jbd2]
#4 [ffff880b8b6639f8] jbd2__journal_start at ffffffffa01fe933 [jbd2]
#5 [ffff880b8b663a40] __ext4_journal_start_sb at ffffffffa03a8ea9 [ext4]
#6 [ffff880b8b663a80] ext4_writepages at ffffffffa0377fc4 [ext4]
#7 [ffff880b8b663bb8] do_writepages at ffffffff81175cde
#8 [ffff880b8b663bc8] __writeback_single_inode at ffffffff81208a20
#9 [ffff880b8b663c08] writeback_sb_inodes at ffffffff8120948e
#10 [ffff880b8b663cb0] __writeback_inodes_wb at ffffffff812096ef
#11 [ffff880b8b663cf8] wb_writeback at ffffffff81209f33
#12 [ffff880b8b663d70] bdi_writeback_workfn at ffffffff8120c005
#13 [ffff880b8b663e20] process_one_work at ffffffff8109d5fb
#14 [ffff880b8b663e68] worker_thread at ffffffff8109e3cb
#15 [ffff880b8b663ec8] kthread at ffffffff810a5aef
#16 [ffff880b8b663f50] ret_from_fork at ffffffff816467d8
PID: 79204 TASK: ffff88084b050b80 CPU: 21 COMMAND: "bridge_exec"
#0 [ffff880d7795fc70] __schedule at ffffffff8163b15d
#1 [ffff880d7795fcd8] schedule at ffffffff8163b7f9
#2 [ffff880d7795fce8] do_exit at ffffffff81081504
#3 [ffff880d7795fd78] do_group_exit at ffffffff81081dff
#4 [ffff880d7795fda8] get_signal_to_deliver at ffffffff81092c10
#5 [ffff880d7795fe40] do_signal at ffffffff81014417
#6 [ffff880d7795ff30] do_notify_resume at ffffffff81014adf
#7 [ffff880d7795ff50] int_signal at ffffffff81646b3d
RIP: 00007fb2e6b2a9b9 RSP: 00007fff14cbc3d8 RFLAGS: 00000246
RAX: fffffffffffffe00 RBX: 00000000048e2020 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000000489bcd8
RBP: 00007fff14cbc420 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000048e2258
R13: 000000000471d898 R14: 00000000048e2188 R15: 000000000489bcc0
ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
Any advice or suggestion will be appreciated!
Best Regards,
Roy
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists