lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 21 Apr 2021 22:55:00 +0800 From: Wen Yang <wenyang@...ux.alibaba.com> To: Theodore Ts'o <tytso@....edu> Cc: Andreas Dilger <adilger.kernel@...ger.ca>, Ritesh Harjani <riteshh@...ux.ibm.com>, Baoyou Xie <baoyou.xie@...baba-inc.com>, linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org Subject: Re: [PATCH] fs/ext4: prevent the CPU from being 100% occupied in ext4_mb_discard_group_preallocations 在 2021/4/19 上午12:06, Theodore Ts'o 写道: > On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote: >> The kworker has occupied 100% of the CPU for several days: >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11 >> >> .... >> >> The thread that references this pa has been waiting for IO to return: >> PID: 15140 TASK: ffff88004d6dc300 CPU: 16 COMMAND: "kworker/u64:1" >> [ffffc900273e7518] __schedule at ffffffff8173ca3b >> [ffffc900273e75a0] schedule at ffffffff8173cfb6 >> [ffffc900273e75b8] io_schedule at ffffffff810bb75a >> [ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1 >> [ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9 >> [ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742 >> [ffffc900273e76b0] __lock_buffer at ffffffff81288c32 >> [ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2] >> [ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2] >> [ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4] >> [ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4] >> [ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4] >> [ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4] >> [ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4] >> [ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4] >> [ffffc900273e7b30] do_writepages at ffffffff811c487e >> [ffffc900273e7b40] __writeback_single_inode at ffffffff81280265 >> [ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2 >> [ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2 >> [ffffc900273e7cd8] wb_writeback at ffffffff81281238 >> [ffffc900273e7d80] wb_workfn at ffffffff812819f4 >> [ffffc900273e7e18] process_one_work at ffffffff810a5dc9 >> [ffffc900273e7e60] worker_thread at ffffffff810a60ae >> [ffffc900273e7ec0] kthread at ffffffff810ac696 >> [ffffc900273e7f50] ret_from_fork at ffffffff81741dd9 >> >> On the bare metal server, we will use multiple hard disks, the Linux >> kernel will run on the system disk, and business programs will run on >> several hard disks virtualized by the BM hypervisor. The reason why IO >> has not returned here is that the process handling IO in the BM hypervisor >> has failed. > > So if the I/O not returning for every days, such that this thread had > been hanging for that long, it also follows that since it was calling > do_get_write_access(), that a handle was open. And if a handle is > open, then the current jbd2 transaction would never close --- which > means none of the file system operations executed over the past few > days would never commit, and would be undone on the next reboot. > Furthermore, sooner or later the journal would run out of space, at > which point the *entire* system would be locked up waiting for the > transaction to close. > > I'm guessing that if the server hadn't come to a full livelock > earlier, it's because there aren't that many metadata operations that > are happening in the server's stable state operation. But in any > case, this particular server was/is(?) doomed, and all of the patches > that you proposed are not going to help in the long run. The correct > fix is to fix the hypervisor, which is the root cause of the problem. > Yes, in the end, the whole system was affected, as follows crash> ps | grep UN 281 2 16 ffff881fb011c300 UN 0.0 0 0 [kswapd_0] 398 358 9 ffff880084094300 UN 0.0 30892 2592 systemd-journal ...... 2093 358 28 ffff880012d2c300 UN 0.0 241676 15108 syslog-ng 2119 358 0 ffff88005a252180 UN 0.0 124340 3148 crond ...... PID: 281 TASK: ffff881fb011c300 CPU: 16 COMMAND: "kswapd_0" #0 [ffffc9000d7af7e0] __schedule at ffffffff8173ca3b #1 [ffffc9000d7af868] schedule at ffffffff8173cfb6 #2 [ffffc9000d7af880] wait_transaction_locked at ffffffffa00db08a [jbd2] #3 [ffffc9000d7af8d8] add_transaction_credits at ffffffffa00db2c0 [jbd2] #4 [ffffc9000d7af938] start_this_handle at ffffffffa00db64f [jbd2] #5 [ffffc9000d7af9c8] jbd2__journal_start at ffffffffa00dbe3e [jbd2] #6 [ffffc9000d7afa18] __ext4_journal_start_sb at ffffffffa023b0dd [ext4] #7 [ffffc9000d7afa58] ext4_release_dquot at ffffffffa02202f2 [ext4] #8 [ffffc9000d7afa78] dqput at ffffffff812b9bef #9 [ffffc9000d7afaa0] __dquot_drop at ffffffff812b9eaf #10 [ffffc9000d7afad8] dquot_drop at ffffffff812b9f22 #11 [ffffc9000d7afaf0] ext4_clear_inode at ffffffffa02291f2 [ext4] #12 [ffffc9000d7afb08] ext4_evict_inode at ffffffffa020a939 [ext4] #13 [ffffc9000d7afb28] evict at ffffffff8126d05a #14 [ffffc9000d7afb50] dispose_list at ffffffff8126d16b #15 [ffffc9000d7afb78] prune_icache_sb at ffffffff8126e2ba #16 [ffffc9000d7afbb0] super_cache_scan at ffffffff8125320e #17 [ffffc9000d7afc08] shrink_slab at ffffffff811cab55 #18 [ffffc9000d7afce8] shrink_node at ffffffff811d000e #19 [ffffc9000d7afd88] balance_pgdat at ffffffff811d0f42 #20 [ffffc9000d7afe58] kswapd at ffffffff811d14f1 #21 [ffffc9000d7afec0] kthread at ffffffff810ac696 #22 [ffffc9000d7aff50] ret_from_fork at ffffffff81741dd9 > I could imagine some kind of retry counter, where we start sleeping > after some number of retries, and give up after some larger number of > retries (at which point the allocation would fail with ENOSPC). We'd > need to do some testing against our current tests which test how we > handle running close to ENOSPC, and I'm not at all convinced it's > worth the effort in the end. We're trying to (slightly) improve the > case where (a) the file system is running close to full, (b) the > hypervisor is critically flawed and is the real problem, and (c) the > VM is eventually doomed to fail anyway due to a transaction never > closing due to an I/O never getting acknowledged for days(!). > Great. If you have any progress, we'll be happy to test it in our production environment. We are also looking forward to working together to optimize it. > If you really want to fix things in the guest OS, I perhaps the > virtio_scsi driver (or whatever I/O driver you are using), should > notice when an I/O request hasn't gotten acknowledged after minutes or > hours, and do something such as force a SCSI reset (which will result > in the file system needing to be unmounted and recovered, but due to > the hypervisor bug, that was an inevitable end result anyway). > Yes, but unfortunately, it may not be finished in a short time. We may refer to the documentation of the qemo community as follows: https://wiki.qemu.org/ToDo/Block Add a cancel command to the virtio-blk device so that running requests can be aborted. This requires changing the VIRTIO spec, extending QEMU's device emulation, and implementing blk_mq_ops->timeout() in Linux virtio_blk.ko. This task depends on first implementing real request cancellation in QEMU. -- Best wishes, Wen
Powered by blists - more mailing lists