linux-ext4 - Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <F2742231-ADEB-4FDF-8A92-DD800AE2EDF1@dilger.ca>
Date:   Wed, 7 Apr 2021 16:36:47 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     Wen Yang <simon.wy@...baba-inc.com>
Cc:     Wen Yang <wenyang@...ux.alibaba.com>,
        riteshh <riteshh@...ux.ibm.com>,
        "Theodore Y. Ts'o" <tytso@....edu>,
        Baoyou Xie <baoyou.xie@...baba-inc.com>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Jan Kara <jack@...e.cz>
Subject: Re: [PATCH] ext4: add a configurable parameter to prevent endless
 loop in ext4_mb_discard_group_preallocations

On Apr 7, 2021, at 5:16 AM, riteshh <riteshh@...ux.ibm.com> wrote:
> 
> On 21/04/07 03:01PM, Wen Yang wrote:
>> From: Wen Yang <simon.wy@...baba-inc.com>
>> 
>> The kworker has occupied 100% of the CPU for several days:
>> PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
>> 68086 root 20 0  0    0   0   R  100.0 0.0  9718:18 kworker/u64:11
>> 
>> And the stack obtained through sysrq is as follows:
>> [20613144.850426] task: ffff8800b5e08000 task.stack: ffffc9001342c000
>> [20613144.850438] Call Trace:
>> [20613144.850439]  [<ffffffffa0244209>] ext4_mb_new_blocks+0x429/0x550 [ext4]
>> [20613144.850439]  [<ffffffffa02389ae>] ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
>> [20613144.850441]  [<ffffffffa0204b52>] ext4_map_blocks+0x172/0x620 [ext4]
>> [20613144.850442]  [<ffffffffa0208675>] ext4_writepages+0x7e5/0xf00 [ext4]
>> [20613144.850443]  [<ffffffff811c487e>] do_writepages+0x1e/0x30
>> [20613144.850444]  [<ffffffff81280265>] __writeback_single_inode+0x45/0x320
>> [20613144.850444]  [<ffffffff81280ab2>] writeback_sb_inodes+0x272/0x600
>> [20613144.850445]  [<ffffffff81280ed2>] __writeback_inodes_wb+0x92/0xc0
>> [20613144.850445]  [<ffffffff81281238>] wb_writeback+0x268/0x300
>> [20613144.850446]  [<ffffffff812819f4>] wb_workfn+0xb4/0x380
>> [20613144.850447]  [<ffffffff810a5dc9>] process_one_work+0x189/0x420
>> [20613144.850447]  [<ffffffff810a60ae>] worker_thread+0x4e/0x4b0
>> 
>> The cpu resources of the cloud server are precious, and the server
>> cannot be restarted after running for a long time, so a configuration
>> parameter is added to prevent this endless loop.
> 
> Strange, if there is a endless loop here. Then I would definitely see
> if there is any accounting problem in pa->pa_count. Otherwise busy=1
> should not be set everytime. ext4_mb_show_pa() function may help debug this.
> 
> If yes, then that means there always exists either a file preallocation
> or a group preallocation. Maybe it is possible, in some use case.
> Others may know of such use case, if any.

If this code is broken, then it doesn't make sense to me that we would
leave it in the "run forever" state after the patch, and require a sysfs
tunable to be set to have a properly working system?

Is there anything particularly strange about the workload/system that
might cause this?  Filesystem is very full, memory is very low, etc?


Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)