linux-kernel - Re: [PATCH v3] sched/core: Always flush pending blk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220709161956.GA20046@windriver.com>
Date:   Sat, 9 Jul 2022 12:19:56 -0400
From:   Paul Gortmaker <paul.gortmaker@...driver.com>
To:     John Keeping <john@...anate.com>
Cc:     linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug

[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:

> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
> 
> 	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> 	      Not tainted 5.15.49-rt46 #1
> 	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
> 	Workqueue: writeback wb_workfn (flush-7:0)

This is great.  I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.

Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced.  Otherise I'd probably not
have had your backtraces grab my attention at all.

Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue.  So...

Tested-by: Paul Gortmaker <paul.gortmaker@...driver.com>

As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.

Thanks,
Paul.

	-----------------


There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01" 

Reproducer details are at the bottom.  What seems to happen is as follows:

The kworker doing writeback wb_workfn (pid 213 below) stalls on the 
one and only down_write in ext4_map_blocks() and never moves again.

      /*
       * New blocks allocate and/or writing to unwritten extent
       * will possibly result in updating i_data, so we take
       * the write lock of i_data_sem, and call get_block()
       * with create == 1 flag.
       */
      down_write(&EXT4_I(inode)->i_data_sem);

So who is blocking the kworker?  There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:

  ext4_truncate+0x46a/0x660
   --> ext4_ext_truncate+0xd6/0xf0
       --> ext4_ext_remove_space+0x59f/0x1830
           --> jbd2_journal_get_write_access+0x51/0x80
               --> do_get_write_access+0x2ad/0x550
                   --> complete_all+0x60/0x60
                       --> bit_wait_io+0x11/0x60

They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.

Here is where I think lockdep shows us the problem:

  5 locks held by ftest01/818:
   [...]
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

  5 locks held by ftest01/821:
   [...]
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

Yes - that is the one the kworker was trying the down_write() on.

Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:

      /*
       * There is one case we have to be very careful about.  If the
       * committing transaction is currently writing this buffer out to disk
       * and has NOT made a copy-out, then we cannot modify the buffer
       * contents at all right now.  The essence of copy-out is that it is
       * the extra copy, not the primary copy, which gets journaled.  If the
       * primary copy is already going to disk then we cannot do copy-out
       * here.
       */
      if (buffer_shadow(bh)) {
              JBUFFER_TRACE(jh, "on shadow: sleep");
              spin_unlock(&jh->b_state_lock);
              wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
              goto repeat;
      }

Presumaby this is a corner case that preempt-rt makes easier to trigger?

The full trace with interspersed lockdep info follows.

Various detailed reproducer information follows that.

	-----------------------

sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:

 task:kworker/u129:34 state:D stack:    0 pid:  213 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-7:0)
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __this_cpu_preempt_check+0x13/0x20
  ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
  schedule+0x48/0xc0
  __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
  rwbase_write_lock+0x98/0x6e0
  down_write+0x46/0x50
  ? ext4_map_blocks+0x156/0x5c0
  ext4_map_blocks+0x156/0x5c0
  ext4_writepages+0xa1a/0x1290
  ? __this_cpu_preempt_check+0x13/0x20
  ? lockdep_hardirqs_on+0xcd/0x150
  do_writepages+0xd2/0x1a0
  __writeback_single_inode+0x64/0x850
  writeback_sb_inodes+0x22d/0x5b0
  __writeback_inodes_wb+0x67/0xe0
  wb_writeback+0x269/0x5f0
  ? get_nr_inodes+0x49/0x70
  wb_workfn+0x43d/0x780
  ? __this_cpu_preempt_check+0x13/0x20
  process_one_work+0x286/0x710
  worker_thread+0x3c/0x3d0
  ? process_one_work+0x710/0x710
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

  6 locks held by kworker/u129:34/213:
   #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
   #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
   #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0


 task:jbd2/loop0-8    state:D stack:    0 pid:  679 ppid:     2 flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  schedule+0x48/0xc0
  jbd2_journal_wait_updates+0x7c/0xf0
  ? schedstat_stop+0x10/0x10
  jbd2_journal_commit_transaction+0x284/0x20e0
  ? __this_cpu_preempt_check+0x13/0x20
  ? find_held_lock+0x35/0xa0
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? try_to_del_timer_sync+0x53/0x80
  ? trace_hardirqs_on+0x3b/0x120
  ? try_to_del_timer_sync+0x53/0x80
  kjournald2+0xd5/0x2b0
  ? schedstat_stop+0x10/0x10
  ? commit_timeout+0x20/0x20
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>


 task:ftest01         state:D stack:    0 pid:  817 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/817:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  818 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
 RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
 RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
 R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/818:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660


 task:ftest01         state:D stack:    0 pid:  819 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/819:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  820 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/820:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  821 ppid:   816 flags:0x00000002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
 RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
 RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
 R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/821:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

		--------------------------------------

Reproducer:

Get latest preempt-rt kernel:
	git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
	checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
	make defconfig
	manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
	make oldconfig and take defaults for other options hidden behind EXPERT
	can optionally enable LOCKDEP options so sysrq-d works.

Get LTP ; build + install
	https://github.com/linux-test-project/ltp.git

Boot and then run reproducer:
   dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
   mkfs.ext4 -F /root/tmp_mount_file
   mount /root/tmp_mount_file /mnt/
   /opt/ltp/runltp -f fs -s ftest01 -d /mnt/
   umount /mnt

   loop device isn't strictly necessary but contains fs fallout when
   inevitable reboot is required to get rid of D state processes.

   Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
   Test hangs forever on v5.15-rt and latest v5.19-rc-rt.  Test threads
   in D state - any other i/o directed at /mnt also hangs in D state.