[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <296029e3-79c3-f603-7c3b-3429aac0e0c3@linux.alibaba.com>
Date: Wed, 21 Apr 2021 22:55:00 +0800
From: Wen Yang <wenyang@...ux.alibaba.com>
To: Theodore Ts'o <tytso@....edu>
Cc: Andreas Dilger <adilger.kernel@...ger.ca>,
Ritesh Harjani <riteshh@...ux.ibm.com>,
Baoyou Xie <baoyou.xie@...baba-inc.com>,
linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] fs/ext4: prevent the CPU from being 100% occupied in
ext4_mb_discard_group_preallocations
在 2021/4/19 上午12:06, Theodore Ts'o 写道:
> On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote:
>> The kworker has occupied 100% of the CPU for several days:
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
>>
>> ....
>>
>> The thread that references this pa has been waiting for IO to return:
>> PID: 15140 TASK: ffff88004d6dc300 CPU: 16 COMMAND: "kworker/u64:1"
>> [ffffc900273e7518] __schedule at ffffffff8173ca3b
>> [ffffc900273e75a0] schedule at ffffffff8173cfb6
>> [ffffc900273e75b8] io_schedule at ffffffff810bb75a
>> [ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1
>> [ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9
>> [ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742
>> [ffffc900273e76b0] __lock_buffer at ffffffff81288c32
>> [ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2]
>> [ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2]
>> [ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4]
>> [ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4]
>> [ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4]
>> [ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4]
>> [ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4]
>> [ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4]
>> [ffffc900273e7b30] do_writepages at ffffffff811c487e
>> [ffffc900273e7b40] __writeback_single_inode at ffffffff81280265
>> [ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2
>> [ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2
>> [ffffc900273e7cd8] wb_writeback at ffffffff81281238
>> [ffffc900273e7d80] wb_workfn at ffffffff812819f4
>> [ffffc900273e7e18] process_one_work at ffffffff810a5dc9
>> [ffffc900273e7e60] worker_thread at ffffffff810a60ae
>> [ffffc900273e7ec0] kthread at ffffffff810ac696
>> [ffffc900273e7f50] ret_from_fork at ffffffff81741dd9
>>
>> On the bare metal server, we will use multiple hard disks, the Linux
>> kernel will run on the system disk, and business programs will run on
>> several hard disks virtualized by the BM hypervisor. The reason why IO
>> has not returned here is that the process handling IO in the BM hypervisor
>> has failed.
>
> So if the I/O not returning for every days, such that this thread had
> been hanging for that long, it also follows that since it was calling
> do_get_write_access(), that a handle was open. And if a handle is
> open, then the current jbd2 transaction would never close --- which
> means none of the file system operations executed over the past few
> days would never commit, and would be undone on the next reboot.
> Furthermore, sooner or later the journal would run out of space, at
> which point the *entire* system would be locked up waiting for the
> transaction to close.
>
> I'm guessing that if the server hadn't come to a full livelock
> earlier, it's because there aren't that many metadata operations that
> are happening in the server's stable state operation. But in any
> case, this particular server was/is(?) doomed, and all of the patches
> that you proposed are not going to help in the long run. The correct
> fix is to fix the hypervisor, which is the root cause of the problem.
>
Yes, in the end, the whole system was affected, as follows
crash> ps | grep UN
281 2 16 ffff881fb011c300 UN 0.0 0 0 [kswapd_0]
398 358 9 ffff880084094300 UN 0.0 30892 2592
systemd-journal
......
2093 358 28 ffff880012d2c300 UN 0.0 241676 15108 syslog-ng
2119 358 0 ffff88005a252180 UN 0.0 124340 3148 crond
......
PID: 281 TASK: ffff881fb011c300 CPU: 16 COMMAND: "kswapd_0"
#0 [ffffc9000d7af7e0] __schedule at ffffffff8173ca3b
#1 [ffffc9000d7af868] schedule at ffffffff8173cfb6
#2 [ffffc9000d7af880] wait_transaction_locked at ffffffffa00db08a [jbd2]
#3 [ffffc9000d7af8d8] add_transaction_credits at ffffffffa00db2c0 [jbd2]
#4 [ffffc9000d7af938] start_this_handle at ffffffffa00db64f [jbd2]
#5 [ffffc9000d7af9c8] jbd2__journal_start at ffffffffa00dbe3e [jbd2]
#6 [ffffc9000d7afa18] __ext4_journal_start_sb at ffffffffa023b0dd [ext4]
#7 [ffffc9000d7afa58] ext4_release_dquot at ffffffffa02202f2 [ext4]
#8 [ffffc9000d7afa78] dqput at ffffffff812b9bef
#9 [ffffc9000d7afaa0] __dquot_drop at ffffffff812b9eaf
#10 [ffffc9000d7afad8] dquot_drop at ffffffff812b9f22
#11 [ffffc9000d7afaf0] ext4_clear_inode at ffffffffa02291f2 [ext4]
#12 [ffffc9000d7afb08] ext4_evict_inode at ffffffffa020a939 [ext4]
#13 [ffffc9000d7afb28] evict at ffffffff8126d05a
#14 [ffffc9000d7afb50] dispose_list at ffffffff8126d16b
#15 [ffffc9000d7afb78] prune_icache_sb at ffffffff8126e2ba
#16 [ffffc9000d7afbb0] super_cache_scan at ffffffff8125320e
#17 [ffffc9000d7afc08] shrink_slab at ffffffff811cab55
#18 [ffffc9000d7afce8] shrink_node at ffffffff811d000e
#19 [ffffc9000d7afd88] balance_pgdat at ffffffff811d0f42
#20 [ffffc9000d7afe58] kswapd at ffffffff811d14f1
#21 [ffffc9000d7afec0] kthread at ffffffff810ac696
#22 [ffffc9000d7aff50] ret_from_fork at ffffffff81741dd9
> I could imagine some kind of retry counter, where we start sleeping
> after some number of retries, and give up after some larger number of
> retries (at which point the allocation would fail with ENOSPC). We'd
> need to do some testing against our current tests which test how we
> handle running close to ENOSPC, and I'm not at all convinced it's
> worth the effort in the end. We're trying to (slightly) improve the
> case where (a) the file system is running close to full, (b) the
> hypervisor is critically flawed and is the real problem, and (c) the
> VM is eventually doomed to fail anyway due to a transaction never
> closing due to an I/O never getting acknowledged for days(!).
>
Great. If you have any progress, we'll be happy to test it in our
production environment. We are also looking forward to working together
to optimize it.
> If you really want to fix things in the guest OS, I perhaps the
> virtio_scsi driver (or whatever I/O driver you are using), should
> notice when an I/O request hasn't gotten acknowledged after minutes or
> hours, and do something such as force a SCSI reset (which will result
> in the file system needing to be unmounted and recovered, but due to
> the hypervisor bug, that was an inevitable end result anyway).
>
Yes, but unfortunately, it may not be finished in a short time.
We may refer to the documentation of the qemo community as follows:
https://wiki.qemu.org/ToDo/Block
Add a cancel command to the virtio-blk device so that running requests
can be aborted. This requires changing the VIRTIO spec, extending QEMU's
device emulation, and implementing blk_mq_ops->timeout() in Linux
virtio_blk.ko. This task depends on first implementing real request
cancellation in QEMU.
--
Best wishes,
Wen
Powered by blists - more mailing lists