lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 21 Apr 2021 22:55:00 +0800
From:   Wen Yang <>
To:     Theodore Ts'o <>
Cc:     Andreas Dilger <>,
        Ritesh Harjani <>,
        Baoyou Xie <>,,
Subject: Re: [PATCH] fs/ext4: prevent the CPU from being 100% occupied in

在 2021/4/19 上午12:06, Theodore Ts'o 写道:
> On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote:
>> The kworker has occupied 100% of the CPU for several days:
>> 68086 root 20 0  0    0   0   R  100.0 0.0  9718:18 kworker/u64:11
>> ....
>> The thread that references this pa has been waiting for IO to return:
>> PID: 15140  TASK: ffff88004d6dc300  CPU: 16  COMMAND: "kworker/u64:1"
>> [ffffc900273e7518] __schedule at ffffffff8173ca3b
>> [ffffc900273e75a0] schedule at ffffffff8173cfb6
>> [ffffc900273e75b8] io_schedule at ffffffff810bb75a
>> [ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1
>> [ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9
>> [ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742
>> [ffffc900273e76b0] __lock_buffer at ffffffff81288c32
>> [ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2]
>> [ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2]
>> [ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4]
>> [ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4]
>> [ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4]
>> [ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4]
>> [ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4]
>> [ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4]
>> [ffffc900273e7b30] do_writepages at ffffffff811c487e
>> [ffffc900273e7b40] __writeback_single_inode at ffffffff81280265
>> [ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2
>> [ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2
>> [ffffc900273e7cd8] wb_writeback at ffffffff81281238
>> [ffffc900273e7d80] wb_workfn at ffffffff812819f4
>> [ffffc900273e7e18] process_one_work at ffffffff810a5dc9
>> [ffffc900273e7e60] worker_thread at ffffffff810a60ae
>> [ffffc900273e7ec0] kthread at ffffffff810ac696
>> [ffffc900273e7f50] ret_from_fork at ffffffff81741dd9
>> On the bare metal server, we will use multiple hard disks, the Linux
>> kernel will run on the system disk, and business programs will run on
>> several hard disks virtualized by the BM hypervisor. The reason why IO
>> has not returned here is that the process handling IO in the BM hypervisor
>> has failed.
> So if the I/O not returning for every days, such that this thread had
> been hanging for that long, it also follows that since it was calling
> do_get_write_access(), that a handle was open.  And if a handle is
> open, then the current jbd2 transaction would never close --- which
> means none of the file system operations executed over the past few
> days would never commit, and would be undone on the next reboot.
> Furthermore, sooner or later the journal would run out of space, at
> which point the *entire* system would be locked up waiting for the
> transaction to close.
> I'm guessing that if the server hadn't come to a full livelock
> earlier, it's because there aren't that many metadata operations that
> are happening in the server's stable state operation.  But in any
> case, this particular server was/is(?) doomed, and all of the patches
> that you proposed are not going to help in the long run.  The correct
> fix is to fix the hypervisor, which is the root cause of the problem.

Yes, in the end, the whole system was affected, as follows

crash> ps | grep UN
     281      2  16  ffff881fb011c300  UN   0.0       0      0  [kswapd_0]
     398    358   9  ffff880084094300  UN   0.0   30892   2592 
    2093    358  28  ffff880012d2c300  UN   0.0  241676  15108  syslog-ng
    2119    358   0  ffff88005a252180  UN   0.0  124340   3148  crond

PID: 281    TASK: ffff881fb011c300  CPU: 16  COMMAND: "kswapd_0"
  #0 [ffffc9000d7af7e0] __schedule at ffffffff8173ca3b
  #1 [ffffc9000d7af868] schedule at ffffffff8173cfb6
  #2 [ffffc9000d7af880] wait_transaction_locked at ffffffffa00db08a [jbd2]
  #3 [ffffc9000d7af8d8] add_transaction_credits at ffffffffa00db2c0 [jbd2]
  #4 [ffffc9000d7af938] start_this_handle at ffffffffa00db64f [jbd2]
  #5 [ffffc9000d7af9c8] jbd2__journal_start at ffffffffa00dbe3e [jbd2]
  #6 [ffffc9000d7afa18] __ext4_journal_start_sb at ffffffffa023b0dd [ext4]
  #7 [ffffc9000d7afa58] ext4_release_dquot at ffffffffa02202f2 [ext4]
  #8 [ffffc9000d7afa78] dqput at ffffffff812b9bef
  #9 [ffffc9000d7afaa0] __dquot_drop at ffffffff812b9eaf
#10 [ffffc9000d7afad8] dquot_drop at ffffffff812b9f22
#11 [ffffc9000d7afaf0] ext4_clear_inode at ffffffffa02291f2 [ext4]
#12 [ffffc9000d7afb08] ext4_evict_inode at ffffffffa020a939 [ext4]
#13 [ffffc9000d7afb28] evict at ffffffff8126d05a
#14 [ffffc9000d7afb50] dispose_list at ffffffff8126d16b
#15 [ffffc9000d7afb78] prune_icache_sb at ffffffff8126e2ba
#16 [ffffc9000d7afbb0] super_cache_scan at ffffffff8125320e
#17 [ffffc9000d7afc08] shrink_slab at ffffffff811cab55
#18 [ffffc9000d7afce8] shrink_node at ffffffff811d000e
#19 [ffffc9000d7afd88] balance_pgdat at ffffffff811d0f42
#20 [ffffc9000d7afe58] kswapd at ffffffff811d14f1
#21 [ffffc9000d7afec0] kthread at ffffffff810ac696
#22 [ffffc9000d7aff50] ret_from_fork at ffffffff81741dd9

> I could imagine some kind of retry counter, where we start sleeping
> after some number of retries, and give up after some larger number of
> retries (at which point the allocation would fail with ENOSPC).  We'd
> need to do some testing against our current tests which test how we
> handle running close to ENOSPC, and I'm not at all convinced it's
> worth the effort in the end.  We're trying to (slightly) improve the
> case where (a) the file system is running close to full, (b) the
> hypervisor is critically flawed and is the real problem, and (c) the
> VM is eventually doomed to fail anyway due to a transaction never
> closing due to an I/O never getting acknowledged for days(!).

Great. If you have any progress, we'll be happy to test it in our 
production environment. We are also looking forward to working together 
to optimize it.

> If you really want to fix things in the guest OS, I perhaps the
> virtio_scsi driver (or whatever I/O driver you are using), should
> notice when an I/O request hasn't gotten acknowledged after minutes or
> hours, and do something such as force a SCSI reset (which will result
> in the file system needing to be unmounted and recovered, but due to
> the hypervisor bug, that was an inevitable end result anyway).

Yes, but unfortunately, it may not be finished in a short time.

We may refer to the documentation of the qemo community as follows:

Add a cancel command to the virtio-blk device so that running requests 
can be aborted. This requires changing the VIRTIO spec, extending QEMU's 
device emulation, and implementing blk_mq_ops->timeout() in Linux 
virtio_blk.ko. This task depends on first implementing real request 
cancellation in QEMU.

Best wishes,

Powered by blists - more mailing lists