linux-ext4 - Re: Tasks stuck jbd2 for a long time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230816022851.GH2247938@mit.edu>
Date:   Tue, 15 Aug 2023 22:28:51 -0400
From:   "Theodore Ts'o" <tytso@....edu>
To:     "Bhatnagar, Rishabh" <risbhat@...zon.com>
Cc:     jack@...e.com, linux-ext4@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        "Park, SeongJae" <sjpark@...zon.com>
Subject: Re: Tasks stuck jbd2 for a long time

It would be helpful if you can translate address in the stack trace to
line numbers.  See [1] and the script in
./scripts/decode_stacktrace.sh in the kernel sources.  (It is
referenced in the web page at [1].)

[1] https://docs.kernel.org/admin-guide/bug-hunting.html

Of course, in order to interpret the line numbers, we'll need a
pointer to the git repo of your kernel sources and the git commit ID
you were using that presumably corresponds to 5.10.184-175.731.amzn2.x86_64.

The stack trace for which I am particularly interested is the one for
the jbd2/md0-8 task, e.g.:

>       Not tainted 5.10.184-175.731.amzn2.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:jbd2/md0-8      state:D stack:    0 pid: 8068 ppid:     2
> flags:0x00004080
> Call Trace:
> __schedule+0x1f9/0x660
>  schedule+0x46/0xb0
>  jbd2_journal_commit_transaction+0x35d/0x1880 [jbd2]  <--------- line #?
>  ? update_load_avg+0x7a/0x5d0
>  ? add_wait_queue_exclusive+0x70/0x70
>  ? lock_timer_base+0x61/0x80
>  ? kjournald2+0xcf/0x360 [jbd2]
>  kjournald2+0xcf/0x360 [jbd2]

Most of the other stack traces you refenced are tasks that are waiting
for the transaction commit to complete so they can proceed with some
file system operation.  The stack traces which have
start_this_handle() in them are examples of this going on.  Stack
traces of tasks that do *not* have start_this_handle() would be
specially interesting.

The question is why is the commit thread blocking, and on what.  It
could be blocking on some I/O; or some memory allocation; or waiting
for some process with an open transation handle to close it.  The line
number of the jbd2 thread in fs/jbd2/commit.c will give us at least a
partial answer to that question.  Of course, then we'll need to answer
the next question --- why is the I/O blocked?  Or why is the memory
allocation not completing?   etc.

I could make some speculation (such as perhaps some memory allocation
is being made without GFP_NOFS, and this is causing a deadlock between
the memory allocation code which is trying to initiate writeback, but
that is blocked on the transaction commit completing), but without
understanding what the jbd2_journal_commit_transaction() is blocking
at  jbd2_journal_commit_transaction+0x35d/0x1880, that would be justa
guess - pure speculation --- without knowing more.

Cheers,

						- Ted