linux-ext4 - RE: Tasks stuck jbd2 for a long time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d82df68eb8514951a7f7acc923132796@amazon.com>
Date:   Fri, 18 Aug 2023 01:31:35 +0000
From:   "Lu, Davina" <davinalu@...zon.com>
To:     "Bhatnagar, Rishabh" <risbhat@...zon.com>, Jan Kara <jack@...e.cz>
CC:     Theodore Ts'o <tytso@....edu>, "jack@...e.com" <jack@...e.com>,
        "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        "Park, SeongJae" <sjpark@...zon.com>
Subject: RE: Tasks stuck jbd2 for a long time

Hi Bhatnagar, 

Looks like this is a similar issue I saw before with fio test (buffered IO with 100 threads), it is also shows "ext4-rsv-conversion" work queue takes lots CPU and make journal update every stuck. It stuck at
[<0>] do_get_write_access+0x291/0x350 [jbd2]
[<0>] jbd2_journal_get_write_access+0x67/0x90 [jbd2]
[<0>] __ext4_journal_get_write_access+0x44/0x90 [ext4]
[<0>] ext4_reserve_inode_write+0x83/0xc0 [ext4]
[<0>] __ext4_mark_inode_dirty+0x50/0x120 [ext4]
[<0>] ext4_convert_unwritten_extents+0x179/0x220 [ext4]
[<0>] ext4_convert_unwritten_io_end_vec+0x64/0xe0 [ext4]
[<0>] ext4_do_flush_completed_IO.isra.0+0xf5/0x190 [ext4]
[<0>] process_one_work+0x1b0/0x350
[<0>] worker_thread+0x49/0x310
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

And show the lock related to 
&journal→j_state_lock-W:  waittime-avg 178.75us, holdtime-avg  4.84 us.
&journal→j_state_lock-R:  waittime-avg 1269.72us, holdtime-avg  11.07 us.

There is a  patch and see if this is the same issue? this is not the finial patch since there may have some issue from Ted. I will forward that email to you in a different loop. I didn't continue on this patch that time since we thought is might not be the real case in RDS.

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 091db733834e..b3c7544798b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5212,7 +5213,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
         * concurrency isn't really necessary.  Limit it to 1.
         */
        EXT4_SB(sb)->rsv_conversion_wq =
-               alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+               alloc_workqueue("ext4-rsv-conversion",
+                               WQ_MEM_RECLAIM | WQ_UNBOUND | __WQ_ORDERED,
+                               num_active_cpus() > 1 ? num_active_cpus() : 1);

Thanks
Davina
-----Original Message-----
From: Bhatnagar, Rishabh <risbhat@...zon.com> 
Sent: Friday, August 18, 2023 4:59 AM
To: Jan Kara <jack@...e.cz>
Cc: Theodore Ts'o <tytso@....edu>; jack@...e.com; linux-ext4@...r.kernel.org; linux-kernel@...r.kernel.org; gregkh@...uxfoundation.org; Park, SeongJae <sjpark@...zon.com>
Subject: Re: Tasks stuck jbd2 for a long time


On 8/17/23 3:49 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Wed 16-08-23 15:53:05, Bhatnagar, Rishabh wrote:
>> I collected dump and looked at some processes that were stuck in 
>> uninterruptible sleep.These are from upstream stable tree:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree
>> /?h=linux-5.10.y
>> (5.10.191)
>>
>> One of them is the journal thread that is waiting for some other 
>> thread to close transaction handle.
>>
>> PID: 10642  TASK: ffff9768823f4000  CPU: 37  COMMAND: "jbd2/md0-8"
>>   #0 [ffffbd6c40c17c60] __schedule+617 at ffffffffbb912df9
>>   #1 [ffffbd6c40c17cf8] schedule+60 at ffffffffbb91330c
>>   #2 [ffffbd6c40c17d08] jbd2_journal_commit_transaction+877 at 
>> ffffffffc016b90d [jbd2] (/home/ec2-user/linux/fs/jbd2/commit.c:497)
>>   #3 [ffffbd6c40c17ea0] kjournald2+282 at ffffffffc01723ba [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/journal.c:214)
>>   #4 [ffffbd6c40c17f10] kthread+279 at ffffffffbb0b9167
>>   #5 [ffffbd6c40c17f50] ret_from_fork+34 at ffffffffbb003802
> Yes, correct. This is waiting for transaction->t_updates to drop to 0.
>
>> One of threads that have started the handle and waiting for journal 
>> to commit and unlock the current transaction. This stack only shows 
>> ext4lazyinit but with lazyinit disabled we have seen other threads 
>> stuck in same place.
>>
>> PID: 10644  TASK: ffff976901010000  CPU: 37  COMMAND: "ext4lazyinit"
>>   #0 [ffffbd6c40c1fbe0] __schedule+617 at ffffffffbb912df9
>>   #1 [ffffbd6c40c1fc78] schedule+60 at ffffffffbb91330c
>>   #2 [ffffbd6c40c1fc88] wait_transaction_locked+137 at 
>> ffffffffc0168089 [jbd2] (/home/ec2-user/linux/fs/jbd2/transaction.c:184)
>>   #3 [ffffbd6c40c1fcd8] add_transaction_credits+62 at 
>> ffffffffc016813e [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/transaction.c:241)
>>   #4 [ffffbd6c40c1fd30] start_this_handle+533 at ffffffffc0168615 
>> [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/transaction.c:416)
>>   #5 [ffffbd6c40c1fdc0] jbd2__journal_start+244 at ffffffffc0168dc4 [jbd2]
>>   #6 [ffffbd6c40c1fe00] __ext4_journal_start_sb+250 at 
>> ffffffffc02ef65a [ext4]
>>   #7 [ffffbd6c40c1fe40] ext4_init_inode_table+190 at ffffffffc0302ace [ext4]
>>   #8 [ffffbd6c40c1feb0] ext4_lazyinit_thread+906 at ffffffffc033ec9a [ext4]
>>   #9 [ffffbd6c40c1ff10] kthread+279 at ffffffffbb0b9167
>> #10 [ffffbd6c40c1ff50] ret_from_fork+34 at ffffffffbb003802
> This thread actually didn't start a transaction. It is *trying* to 
> start a transaction but it has failed and we are now waiting for 
> transaction commit to proceed (i.e., for jbd2/md0-8 process). So this 
> isn't the process jbd2 is waiting for.
>
>> To replicate the download scenario i'm just using dd to copy random 
>> data to disk. I launch a bunch of threads and try to stress the 
>> system. Many of those threads seem to be stuck in 
>> balance_dirty_pages_ratelimited as can be seen below.
>>
>> PID: 10709  TASK: ffff9769016f8000  CPU: 25  COMMAND: "dd"
>>   #0 [ffffbd6c40dafa48] __schedule+617 at ffffffffbb912df9
>>   #1 [ffffbd6c40dafae0] schedule+60 at ffffffffbb91330c
>>   #2 [ffffbd6c40dafaf0] schedule_timeout+570 at ffffffffbb916a7a
>>   #3 [ffffbd6c40dafb68] io_schedule_timeout+25 at ffffffffbb913619 
>> ((inlined
>> by) io_schedule_finish at /home/ec2-user/linux/kernel/sched/core.c:6274)
>>   #4 [ffffbd6c40dafb80] balance_dirty_pages+654 at ffffffffbb2367ce
>> (/home/ec2-user/linux/mm/page-writeback.c:1799)
>>   #5 [ffffbd6c40dafcf0] balance_dirty_pages_ratelimited+763 at 
>> ffffffffbb23752b  (/home/ec2-user/linux/mm/page-writeback.c:1926)
>>   #6 [ffffbd6c40dafd18] generic_perform_write+308 at ffffffffbb22af44
>> (/home/ec2-user/linux/mm/filemap.c:3370)
>>   #7 [ffffbd6c40dafd88] ext4_buffered_write_iter+161 at 
>> ffffffffc02fcba1 [ext4] (/home/ec2-user/linux/fs/ext4/file.c:273)
>>   #8 [ffffbd6c40dafdb8] ext4_file_write_iter+96 at ffffffffc02fccf0 [ext4]
>>   #9 [ffffbd6c40dafe40] new_sync_write+287 at ffffffffbb2e0c0f
>> #10 [ffffbd6c40dafec8] vfs_write+481 at ffffffffbb2e3161
>> #11 [ffffbd6c40daff00] ksys_write+165 at ffffffffbb2e3385
>> #12 [ffffbd6c40daff40] do_syscall_64+51 at ffffffffbb906213
>> #13 [ffffbd6c40daff50] entry_SYSCALL_64_after_hwframe+103 at 
>> ffffffffbba000df
> Yes, this is waiting for page writeback to reduce amount of dirty 
> pages in the pagecache. We are not holding transaction handle during 
> this wait so this is also not the task jbd2 is waiting for.
>
>> There are other dd threads that are trying to read and are handling 
>> page fault. These are in runnable state and not uninterruptible sleep.
>>
>> PID: 14581  TASK: ffff97c3cfdbc000  CPU: 29  COMMAND: "dd"
>>   #0 [ffffbd6c4a1d3598] __schedule+617 at ffffffffbb912df9
>>   #1 [ffffbd6c4a1d3630] _cond_resched+38 at ffffffffbb9133e6
>>   #2 [ffffbd6c4a1d3638] shrink_page_list+126 at ffffffffbb2412fe
>>   #3 [ffffbd6c4a1d36c8] shrink_inactive_list+478 at ffffffffbb24441e
>>   #4 [ffffbd6c4a1d3768] shrink_lruvec+957 at ffffffffbb244e3d
>>   #5 [ffffbd6c4a1d3870] shrink_node+552 at ffffffffbb2452a8
>>   #6 [ffffbd6c4a1d38f0] do_try_to_free_pages+201 at ffffffffbb245829
>>   #7 [ffffbd6c4a1d3940] try_to_free_pages+239 at ffffffffbb246c0f
>>   #8 [ffffbd6c4a1d39d8] __alloc_pages_slowpath.constprop.114+913 at
>> ffffffffbb28d741
>>   #9 [ffffbd6c4a1d3ab8] __alloc_pages_nodemask+679 at 
>> ffffffffbb28e2e7
>> #10 [ffffbd6c4a1d3b28] alloc_pages_vma+124 at ffffffffbb2a734c
>> #11 [ffffbd6c4a1d3b68] handle_mm_fault+3999 at ffffffffbb26de2f
>> #12 [ffffbd6c4a1d3c28] exc_page_fault+708 at ffffffffbb909c84
>> #13 [ffffbd6c4a1d3c80] asm_exc_page_fault+30 at ffffffffbba00b4e
>>   #14 [ffffbd6c4a1d3d30] copyout+28 at ffffffffbb5160bc
>> #15 [ffffbd6c4a1d3d38] _copy_to_iter+158 at ffffffffbb5188de
>> #16 [ffffbd6c4a1d3d98] get_random_bytes_user+136 at ffffffffbb644608
>> #17 [ffffbd6c4a1d3e48] new_sync_read+284 at ffffffffbb2e0a5c
>> #18 [ffffbd6c4a1d3ed0] vfs_read+353 at ffffffffbb2e2f51
>> #19 [ffffbd6c4a1d3f00] ksys_read+165 at ffffffffbb2e3265
>> #20 [ffffbd6c4a1d3f40] do_syscall_64+51 at ffffffffbb906213
>> #21 [ffffbd6c4a1d3f50] entry_SYSCALL_64_after_hwframe+103 at 
>> ffffffffbba000df
> This process is in direct reclaim trying to free more memory. It 
> doesn't have transaction handle started so jbd2 also isn't waiting for 
> this process.
>
>>>> Can low available memory be a reason for a thread to not be able to 
>>>> close the transaction handle for a long time?
>>>> Maybe some writeback thread starts the handle but is not able to 
>>>> complete writeback?
>>> Well, even that would be a bug but low memory conditions are 
>>> certainly some of less tested paths so it is possible there's a bug lurking there.
>> Amongst the things we have tested 2 things seem to give good improvements.
>>
>> One is disabling journalling. We don't see any stuck tasks. System 
>> becomes slow but eventually recovers. But its not something we want to disable.
>>
>> Other is enabling swap memory. Adding some swap memory also avoids 
>> system going into low memory state and system doesn't freeze.
> OK, these are just workarounds. The question really is which process 
> holds the transaction handle jbd2 thread is waiting for. It is none of 
> the processes you have shown above. Since you have the crashdump, you 
> can also search all the processes and find those which have non-zero
> task->journal_info. And from these processes you can select those 
> task->where journal_info points to an object from jbd2_handle_cache 
> task->and then you
> can verify whether the handles indeed point (through 
> handle->h_transaction) to the transaction jbd2 thread is trying to 
> commit. After you've identified such task it is interesting to see what is it doing...
>
Hi Jan

I think I found the thread that is holding the transaction handle. It seems to be in runnable state though.
It has the journal_info set to the journal handle that has the matching transaction as the journal's running transaction.
Here is the associated stack trace. It is converting unwritten extents to extents.

PID: 287    TASK: ffff976801890000  CPU: 20  COMMAND: "kworker/u96:35"
  #0 [ffffbd6c40b3f498] __schedule+617 at ffffffffbb912df9
  #1 [ffffbd6c40b3f530] _cond_resched+38 at ffffffffbb9133e6
  #2 [ffffbd6c40b3f538] shrink_lruvec+670 at ffffffffbb244d1e
  #3 [ffffbd6c40b3f640] _cond_resched+21 at ffffffffbb9133d5
  #4 [ffffbd6c40b3f648] shrink_node+552 at ffffffffbb2452a8
  #5 [ffffbd6c40b3f6c8] do_try_to_free_pages+201 at ffffffffbb245829
  #6 [ffffbd6c40b3f718] try_to_free_pages+239 at ffffffffbb246c0f
  #7 [ffffbd6c40b3f7b0] __alloc_pages_slowpath.constprop.114+913 at
ffffffffbb28d741
  #8 [ffffbd6c40b3f890] __alloc_pages_nodemask+679 at ffffffffbb28e2e7
  #9 [ffffbd6c40b3f900] allocate_slab+726 at ffffffffbb2b0886
#10 [ffffbd6c40b3f958] ___slab_alloc+1173 at ffffffffbb2b3ff5
#11 [ffffbd6c40b3f988] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#12 [ffffbd6c40b3f9b8] kmem_cache_free+924 at ffffffffbb2b712c )(inlined
by) slab_alloc at /home/ec2-user/linux/mm/slub.c:2904)
#13 [ffffbd6c40b3fa18] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#14 [ffffbd6c40b3fa40] kmem_cache_alloc+928 at ffffffffbb2b4590
(/home/ec2-user/linux/mm/slub.c:290)
#15 [ffffbd6c40b3fa78] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#16 [ffffbd6c40b3faa0] __ext4_forget+338 at ffffffffc02efb32 [ext4]
(/home/ec2-user/linux/fs/ext4/ext4_jbd2.c:298)
#17 [ffffbd6c40b3fae0] ext4_free_blocks+2437 at ffffffffc031fd55 [ext4]
(/home/ec2-user/linux/fs/ext4/mballoc.c:5709 (discriminator 2))
#18 [ffffbd6c40b3fbb0] ext4_ext_handle_unwritten_extents+596 at
ffffffffc02f56a4 [ext4] ((inlined by) ext4_ext_handle_unwritten_extents at /home/ec2-user/linux/fs/ext4/extents.c:3892)
#19 [ffffbd6c40b3fc98] ext4_ext_map_blocks+1325 at ffffffffc02f710d [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4165)
#20 [ffffbd6c40b3fd60] ext4_map_blocks+813 at ffffffffc030bd0d [ext4]
(/home/ec2-user/linux/fs/ext4/inode.c:659)
#21 [ffffbd6c40b3fdd0] ext4_convert_unwritten_extents+303 at ffffffffc02f8adf [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4810)
#22 [ffffbd6c40b3fe28] ext4_convert_unwritten_io_end_vec+95 at ffffffffc02f8c5f [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4850)
#23 [ffffbd6c40b3fe58] ext4_end_io_rsv_work+269 at ffffffffc032c3fd [ext4] ((inlined by) ext4_do_flush_completed_IO at
/home/ec2-user/linux/fs/ext4/page-io.c:262)
#24 [ffffbd6c40b3fe98] process_one_work+405 at ffffffffbb0b2725
#25 [ffffbd6c40b3fed8] worker_thread+48 at ffffffffbb0b2920
#26 [ffffbd6c40b3ff10] kthread+279 at ffffffffbb0b9167
#27 [ffffbd6c40b3ff50] ret_from_fork+34 at ffffffffbb003802

Thanks
Rishabh