linux-ext4 - Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 1 Sep 2023 10:33:02 +0800
From:   Zhang Yi <yi.zhang@...weicloud.com>
To:     Jan Kara <jack@...e.cz>
Cc:     linux-ext4@...r.kernel.org, tytso@....edu,
        adilger.kernel@...ger.ca, yi.zhang@...wei.com,
        chengzhihao1@...wei.com, yukuai3@...wei.com
Subject: Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for
 delalloc mount option

Hello! Thanks for your reply.

On 2023/8/30 23:30, Jan Kara wrote:
> Hello!
> 
> On Thu 24-08-23 17:26:03, Zhang Yi wrote:
>> The delayed allocation method allocates blocks during page writeback in
>> ext4_writepages(), which cannot handle block allocation failures due to
>> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
>> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
>> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
>> delalloc mode if the space if the free blocks is less than 150% of dirty
>> blocks or the watermark.
> 
> Well, that functionality is there mainly so that we can really allocate all
> blocks available in the filesystem. But you are right that since we are not
> reserving metadata blocks explicitely anymore, it is questionable whether
> we really still need this.
> 
>> In the meantime, '27dd43854227 ("ext4:
>> introduce reserved space")' reserve some of the file system space (2% or
>> 4096 clusters, whichever is smaller). Both of those to solutions can
>> make sure that space is not exhausted when mapping delalloc blocks in
>> most cases, but cannot guarantee work in all cases, which could lead to
>> infinite loop or data loss (please see patch 14 for details).
> 
> OK, I agree that in principle there could be problems due to percpu
> counters inaccuracy etc. but were you able to reproduce the problem under
> some at least somewhat realistic conditions? We were discussing making
> free space percpu counters switch to exact counting in case we are running
> tight on space to avoid these problems but it never proved to be a problem
> in practice so we never bothered to actually implement it.
> 

Yes, we catch this problem in our products firstly and we reproduced it
when doing stress test on low free space disk, but the frequency is
very low. After analysis we found the root cause, and Zhihao helped to
write a 100% reproducer below.

1. Apply the 'infinite_loop.diff' in attachment, add info and delay
   into ext4 code.
2. Run 'enospc.sh' on a virtual machine with 4 CPU (important, because
   the cpu number will affect EXT4_FREECLUSTERS_WATERMARK and also
   affect the reproduce).

After several minutes, the writeback process will loop infinitely, and
other processes which rely on it will hung.

[  304.815575] INFO: task sync:7292 blocked for more than 153 seconds.
[  304.818130]       Not tainted 6.5.0-dirty #578
[  304.819926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  304.822747] task:sync            state:D stack:0     pid:7292  ppid:1      flags:0x00004000
[  304.825677] Call Trace:
[  304.826538]  <TASK>
[  304.827307]  __schedule+0x577/0x12b0
[  304.828300]  ? sync_fs_one_sb+0x50/0x50
[  304.829433]  schedule+0x9d/0x1e0
[  304.830451]  wb_wait_for_completion+0x82/0xd0
[  304.831811]  ? cpuacct_css_alloc+0x100/0x100
[  304.833090]  sync_inodes_sb+0xf1/0x440
[  304.834207]  ? sync_fs_one_sb+0x50/0x50
[  304.835304]  sync_inodes_one_sb+0x21/0x30
[  304.836528]  iterate_supers+0xd2/0x180
[  304.837614]  ksys_sync+0x50/0xf0
[  304.838356]  __do_sys_sync+0x12/0x20
[  304.839207]  do_syscall_64+0x68/0xf0
[  304.839964]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

On the contrary, after doing a little tweaking the delay injection
procedure, we could reproduce the data loss problem easily.

1. Apply the 'data_loss.diff' in the attachment.
2. Run 'enospc.sh' like the previous one, then we got below error message.

[   52.226320] EXT4-fs (sda): Delayed block allocation failed for inode 571 at logical offset 8 with max blocks 1 with error 28
[   52.229126] EXT4-fs (sda): This should not happen!! Data will be lost

>> This patch set wants to reserve metadata space more accurate for
>> delalloc mount option. The metadata blocks reservation is very tricky
>> and is also related to the continuity of physical blocks, an effective
>> way is to reserve as the worst case, which means that every data block
>> is discontinuous and one data block costs an extent entry. Reserve
>> metadata space as the worst case can make sure enough blocks reserved
>> during data writeback, the unused reservaion space can be released after
>> mapping data blocks.
> 
> Well, as you say, there is a problem with the worst case estimates - either
> you *heavily* overestimate the number of needed metadata blocks or the code
> to estimate the number of needed metadata blocks is really complex. We used
> to have estimates of needed metadata and we ditched that code (in favor of
> reserved clusters) exactly because it was complex and suffered from
> cornercases that were hard to fix. I haven't quite digested the other
> patches in this series to judge which case is it but it seems to lean on
> the "complex" side :).
> 
> So I'm somewhat skeptical this complexity is really needed but I can be
> convinced :).

I understand your concern. At first I tried to solve this problem with
other simple solutions, but failed. I suppose reserve blocks for the
worst case is the only way to cover all cases, and I noticed that xfs
also uses this reservation method, so I learned the implementation from
it, but it's not exactly the same.

Although it's a worst-case reservation, it is not that complicated.
Firstly, the estimate formula is simple, just add the 'extent & node'
blocks calculated by the **total** delay allocated data blocks and the
remaining btree heights (no need to concern whether the logical
positions of each extents are continuous or not, the btree heights can
be merged between each discontinuous extent entries of one inode, this
can reduce overestimate to some extent). Secondary, the method of
reserving metadata blocks is similar to that of reserving data blocks,
just the estimate formula is different. Fortunately, there already have
data reservation helpers like ext4_da_update_reserve_space() and
ext4_da_release_reserve_space(), it works only takes some minor
changes. BTW, I don't really know the ditched estimation and
cornercases you mentioned, how it's like?

Finally, maybe this reservation could bring other benefits in the long
run. For example, after we've done this, maybe we could also reserve
metadata for DIO and buffer IO with dioread_nolock in the future, then
we could drop EXT4_GET_BLOCKS_PRE_IO safely, which looks like a
compromise, maybe we could get some improve performance if do this (I
haven't thought deeply, just a whim :) ). But it's a different thing.

> 
>> After doing this, add a worker to submit delayed
>> allocations to prevent excessive reservations. Finally, we could
>> completely drop the policy of switching back to non-delayed allocation.
> 
> BTW the worker there in patch 15 seems really pointless. If you do:
> queue_work(), flush_work() then you could just directly do the work inline
> and get as a bonus more efficiency and proper lockdep tracking of
> dependencies. But that's just a minor issue I have noticed.
> 

Yes, I added this worker because I want to run the work asynchronously
if the s_dirtyclusters_counter is running beyond watermark. In this way,
the I/O flow could be smoother. But I didn't implement it because I
don't know if you like this estimate solution, I can do it if so.

Thanks,
Yi.



View attachment "data_loss.diff" of type "text/plain" (2808 bytes)

View attachment "infinite_loop.diff" of type "text/plain" (2808 bytes)

View attachment "enospc.sh" of type "text/plain" (908 bytes)