lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7fa5a43f-bdd6-9cf1-172a-b2af47239e96@windriver.com>
Date:   Tue, 10 Nov 2020 09:57:39 -0600
From:   Chris Friesen <chris.friesen@...driver.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: looking for assistance with jbd2 (and other processes) hung
 trying to write to disk

On 11/10/2020 5:42 AM, Jan Kara wrote:
> On Mon 09-11-20 15:11:58, Chris Friesen wrote:

>> Can anyone give some suggestions on how to track down what's causing the
>> delay here?  I suspect there's a race condition somewhere similar to what
>> happened with https://access.redhat.com/solutions/3226391, although that one
>> was specific to device-mapper and the root filesystem here is directly on
>> the nvme device.
> 
> Sadly I don't have access to RH portal to be able to check what that hang
> was about...

They had exactly the same stack trace (different addresses) with 
"jbd2/dm-16-8" trying to commit a journal transaction.  In their case it 
was apparently due to two problems, "the RAID1 code leaking the r1bio", 
and "dm-raid not handling a needed retry scenario".  They fixed it by 
backporting upstream commits.  The kernel we're running should have 
those fixes, and in our case we're operating directly on an nvme device.

>> crash> ps -m 930
>> [0 00:09:11.694] [UN]  PID: 930    TASK: ffffa14b5f9032c0  CPU: 1 COMMAND:
>> "jbd2/nvme2n1p4-"
>>
> 
> Are the tasks below the only ones hanging in D state (UN state in crash)?
> Because I can see processes are waiting for the locked buffer but it is
> unclear who is holding the buffer lock...

No, there are quite a few of them.  I've included them below.  I agree, 
it's not clear who's holding the lock.  Is there a way to find that out?

Just to be sure, I'm looking for whoever has the BH_Lock bit set on the 
buffer_head "b_state" field, right?  I don't see any ownership field the 
way we have for mutexes.  Is there some way to find out who would have 
locked the buffer?

Do you think it would help at all to enable CONFIG_JBD_DEBUG?

Processes in "UN" state in crashdump:

crash> ps|grep UN
       1      0   1  ffffa14b687d8000  UN   0.0  193616   6620  systemd
     930      2   1  ffffa14b5f9032c0  UN   0.0       0      0 
[jbd2/nvme2n1p4-]
    1489      2   1  ffffa14b641f0000  UN   0.0       0      0 
[jbd2/dm-0-8]
    1494      2   1  ffffa14b641f2610  UN   0.0       0      0 
[jbd2/dm-11-8]
    1523      2   1  ffffa14b64182610  UN   0.0       0      0 
[jbd2/dm-1-8]
    1912      1   1  ffffa14b62dc2610  UN   0.0  117868  17568  syslog-ng
   86293      1   1  ffffa14ae4650cb0  UN   0.1 4618100 116664  containerd
   86314      1   1  ffffa14ae2639960  UN   0.1 4618100 116664  containerd
   88019      1   1  ffffa14ae26ad8d0  UN   0.2  651196 210260  safe_timer
   90539      1   1  ffffa13caca3bf70  UN   0.0   25868   2140  fsmond
   94006  93595   1  ffffa14ae31fe580  UN   0.1 13843140 113604  etcd
   95061  93508   1  ffffa14a913e8cb0  UN   0.1  721888 114652  log
   96367      1   1  ffffa14af53f9960  UN   0.0  119404  19084  python
   121292      1   1  ffffa14ae18932c0  UN   0.1 4618100 116664  containerd
   122042      1   1  ffffa14a950a6580  UN   0.0  111680   9496 
containerd-shim
   126119  122328  23  ffffa14b3d76a610  UN   0.0       0      0  com.xcg
   126171  122328  47  ffffa14a91571960  UN   0.0       0      0  com.xcg
   126173  122328  23  ffffa14a91573f70  UN   0.0       0      0  com.xcg
   126177  122328  23  ffffa14a91888000  UN   0.0       0      0  com.xcg
   128049  124763  47  ffffa14a964e6580  UN   0.1 1817292  80388  confd
   136938  136924   1  ffffa14b5bb7d8d0  UN   0.0  146256  25672  coredns
   136972  136924   1  ffffa14a9aae2610  UN   0.0  146256  25672  coredns
   136978  136924   1  ffffa14ae2238000  UN   0.0  146256  25672  coredns
   143026  142739   1  ffffa14b035e0000  UN   0.0       0      0  cainjector
   166456  165537  44  ffffa14af3cb8000  UN   0.0  325468  10736  nronmd.xcg
   166712  165537  44  ffffa149a2fecc20  UN   0.0  200116   3728  vpms.xcg
   166725  165537  44  ffffa14962fb6580  UN   0.1 2108336  58176  vrlcb.xcg
   166860  165537  45  ffffa14afd22bf70  UN   0.0  848320  12180  gcci.xcg
   166882  165537  45  ffffa14aff3c58d0  UN   0.0  693256  11624  ndc.xcg
   167556  165537  44  ffffa14929a6cc20  UN   0.0  119604   2612  gcdm.xcg
   170732  122328  23  ffffa1492987bf70  UN   0.0  616660   4348  com.xcg
   170741  122328  46  ffffa1492987cc20  UN   0.0       0      0  com.xcg
   170745  122328  23  ffffa1492987e580  UN   0.0       0      0  com.xcg
   170750  122328  23  ffffa14924d4f230  UN   0.0       0      0  com.xcg
   170774  122328  23  ffffa14924d4bf70  UN   0.0       0      0  com.xcg
   189870  187717  46  ffffa14873591960  UN   0.1  881516  83840  filebeat
   332649  136924   1  ffffa147efd49960  UN   0.0  146256  25672  coredns
   1036113  3779184  23  ffffa13c9317bf70  UN   0.9 6703644 878848 
pool-3-thread-1
   1793349      2   1  ffffa14ae2402610  UN   0.0       0      0 
[kworker/1:0]
   1850718  166101   0  ffffa14807448cb0  UN   0.0   18724   6068  exe
   1850727  1850722   0  ffffa147e18dd8d0  UN   0.0   18724   6068  exe
   1850733  120305   1  ffffa147e18da610  UN   0.0  135924   6512  runc
   1850792  128006  46  ffffa14ae1948cb0  UN   0.0   21716   1280  logrotate
   1850914  1850911   1  ffffa147086dbf70  UN   0.0   18724   6068  exe
   1851274  127909  46  ffffa14703661960  UN   0.0   53344   3232 
redis-server
   1851474  1850787   1  ffffa1470026cc20  UN   0.0  115704   1244  ceph
   1853422  1853340  44  ffffa146dfdc1960  UN   0.0   12396   2312  sh
   1854005      1   1  ffffa146d7d8f230  UN   0.0  116872    812  mkdir
   1854955  2847282   1  ffffa146c5d18cb0  UN   0.0   18724   6068  exe
   1856515  166108   1  ffffa146aa071960  UN   0.0   18724   6068  exe
   1856602  84624   1  ffffa146aa073f70  UN   0.0  184416   1988  crond
   1859661  1859658   1  ffffa14672090000  UN   0.0  116872    812  mkdir
   2232051  165443   7  ffffa147e1ac0000  UN   0.0       0      0 
eal-intr-thread


Thanks,
Chris

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ