[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7fa5a43f-bdd6-9cf1-172a-b2af47239e96@windriver.com>
Date: Tue, 10 Nov 2020 09:57:39 -0600
From: Chris Friesen <chris.friesen@...driver.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org
Subject: Re: looking for assistance with jbd2 (and other processes) hung
trying to write to disk
On 11/10/2020 5:42 AM, Jan Kara wrote:
> On Mon 09-11-20 15:11:58, Chris Friesen wrote:
>> Can anyone give some suggestions on how to track down what's causing the
>> delay here? I suspect there's a race condition somewhere similar to what
>> happened with https://access.redhat.com/solutions/3226391, although that one
>> was specific to device-mapper and the root filesystem here is directly on
>> the nvme device.
>
> Sadly I don't have access to RH portal to be able to check what that hang
> was about...
They had exactly the same stack trace (different addresses) with
"jbd2/dm-16-8" trying to commit a journal transaction. In their case it
was apparently due to two problems, "the RAID1 code leaking the r1bio",
and "dm-raid not handling a needed retry scenario". They fixed it by
backporting upstream commits. The kernel we're running should have
those fixes, and in our case we're operating directly on an nvme device.
>> crash> ps -m 930
>> [0 00:09:11.694] [UN] PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND:
>> "jbd2/nvme2n1p4-"
>>
>
> Are the tasks below the only ones hanging in D state (UN state in crash)?
> Because I can see processes are waiting for the locked buffer but it is
> unclear who is holding the buffer lock...
No, there are quite a few of them. I've included them below. I agree,
it's not clear who's holding the lock. Is there a way to find that out?
Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
buffer_head "b_state" field, right? I don't see any ownership field the
way we have for mutexes. Is there some way to find out who would have
locked the buffer?
Do you think it would help at all to enable CONFIG_JBD_DEBUG?
Processes in "UN" state in crashdump:
crash> ps|grep UN
1 0 1 ffffa14b687d8000 UN 0.0 193616 6620 systemd
930 2 1 ffffa14b5f9032c0 UN 0.0 0 0
[jbd2/nvme2n1p4-]
1489 2 1 ffffa14b641f0000 UN 0.0 0 0
[jbd2/dm-0-8]
1494 2 1 ffffa14b641f2610 UN 0.0 0 0
[jbd2/dm-11-8]
1523 2 1 ffffa14b64182610 UN 0.0 0 0
[jbd2/dm-1-8]
1912 1 1 ffffa14b62dc2610 UN 0.0 117868 17568 syslog-ng
86293 1 1 ffffa14ae4650cb0 UN 0.1 4618100 116664 containerd
86314 1 1 ffffa14ae2639960 UN 0.1 4618100 116664 containerd
88019 1 1 ffffa14ae26ad8d0 UN 0.2 651196 210260 safe_timer
90539 1 1 ffffa13caca3bf70 UN 0.0 25868 2140 fsmond
94006 93595 1 ffffa14ae31fe580 UN 0.1 13843140 113604 etcd
95061 93508 1 ffffa14a913e8cb0 UN 0.1 721888 114652 log
96367 1 1 ffffa14af53f9960 UN 0.0 119404 19084 python
121292 1 1 ffffa14ae18932c0 UN 0.1 4618100 116664 containerd
122042 1 1 ffffa14a950a6580 UN 0.0 111680 9496
containerd-shim
126119 122328 23 ffffa14b3d76a610 UN 0.0 0 0 com.xcg
126171 122328 47 ffffa14a91571960 UN 0.0 0 0 com.xcg
126173 122328 23 ffffa14a91573f70 UN 0.0 0 0 com.xcg
126177 122328 23 ffffa14a91888000 UN 0.0 0 0 com.xcg
128049 124763 47 ffffa14a964e6580 UN 0.1 1817292 80388 confd
136938 136924 1 ffffa14b5bb7d8d0 UN 0.0 146256 25672 coredns
136972 136924 1 ffffa14a9aae2610 UN 0.0 146256 25672 coredns
136978 136924 1 ffffa14ae2238000 UN 0.0 146256 25672 coredns
143026 142739 1 ffffa14b035e0000 UN 0.0 0 0 cainjector
166456 165537 44 ffffa14af3cb8000 UN 0.0 325468 10736 nronmd.xcg
166712 165537 44 ffffa149a2fecc20 UN 0.0 200116 3728 vpms.xcg
166725 165537 44 ffffa14962fb6580 UN 0.1 2108336 58176 vrlcb.xcg
166860 165537 45 ffffa14afd22bf70 UN 0.0 848320 12180 gcci.xcg
166882 165537 45 ffffa14aff3c58d0 UN 0.0 693256 11624 ndc.xcg
167556 165537 44 ffffa14929a6cc20 UN 0.0 119604 2612 gcdm.xcg
170732 122328 23 ffffa1492987bf70 UN 0.0 616660 4348 com.xcg
170741 122328 46 ffffa1492987cc20 UN 0.0 0 0 com.xcg
170745 122328 23 ffffa1492987e580 UN 0.0 0 0 com.xcg
170750 122328 23 ffffa14924d4f230 UN 0.0 0 0 com.xcg
170774 122328 23 ffffa14924d4bf70 UN 0.0 0 0 com.xcg
189870 187717 46 ffffa14873591960 UN 0.1 881516 83840 filebeat
332649 136924 1 ffffa147efd49960 UN 0.0 146256 25672 coredns
1036113 3779184 23 ffffa13c9317bf70 UN 0.9 6703644 878848
pool-3-thread-1
1793349 2 1 ffffa14ae2402610 UN 0.0 0 0
[kworker/1:0]
1850718 166101 0 ffffa14807448cb0 UN 0.0 18724 6068 exe
1850727 1850722 0 ffffa147e18dd8d0 UN 0.0 18724 6068 exe
1850733 120305 1 ffffa147e18da610 UN 0.0 135924 6512 runc
1850792 128006 46 ffffa14ae1948cb0 UN 0.0 21716 1280 logrotate
1850914 1850911 1 ffffa147086dbf70 UN 0.0 18724 6068 exe
1851274 127909 46 ffffa14703661960 UN 0.0 53344 3232
redis-server
1851474 1850787 1 ffffa1470026cc20 UN 0.0 115704 1244 ceph
1853422 1853340 44 ffffa146dfdc1960 UN 0.0 12396 2312 sh
1854005 1 1 ffffa146d7d8f230 UN 0.0 116872 812 mkdir
1854955 2847282 1 ffffa146c5d18cb0 UN 0.0 18724 6068 exe
1856515 166108 1 ffffa146aa071960 UN 0.0 18724 6068 exe
1856602 84624 1 ffffa146aa073f70 UN 0.0 184416 1988 crond
1859661 1859658 1 ffffa14672090000 UN 0.0 116872 812 mkdir
2232051 165443 7 ffffa147e1ac0000 UN 0.0 0 0
eal-intr-thread
Thanks,
Chris
Powered by blists - more mailing lists