linux-ext4 - Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a2af4039-65e0-6bb5-337e-fc160ae94aee@huawei.com>
Date: Thu, 30 Nov 2023 16:22:32 +0800
From: Zhang Yi <yi.zhang@...wei.com>
To: Christoph Hellwig <hch@...radead.org>, Ritesh Harjani
	<ritesh.list@...il.com>
CC: Jan Kara <jack@...e.cz>, <linux-ext4@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use
 iomap

On 2023/11/30 12:30, Christoph Hellwig wrote:
> On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote:
>> So I took a look at this. Here is what I think -
>> So this is useful of-course when we have a large folio. Because
>> otherwise it's just one block at a time for each folio. This is not a
>> problem once FS buffered-io handling code moves to iomap (because we
>> can then enable large folio support to it).
> 
> Yes.
> 
>> However, this would still require us to pass a folio to ->map_blocks
>> call to determine the size of the folio (which I am not saying can't be
>> done but just stating my observations here).
> 
> XFS currently maps based on the underlyig reservation (delalloc extent)
> and not the actual map size.   This works because on-disk extents are
> allocated as unwritten extents, and only the actual written part is
> the converted.  But if you only want to allocate blocks for the part

IIUC, I noticed a side effect of this map method, Let's think about a
special case, if we sync a partial range of a delalloc extent, and then
the system crashes before the remaining data write back. We could get a
file which has allocated blocks(unwritten) beyond EOF. I can reproduce
it on xfs.

 # write 10 blocks, but only sync 3 blocks, i_size becomes 12K after IO complete
 xfs_io -f -c "pwrite 0 40k" -c "sync_range 0 12k" /mnt/foo
 # postpone the remaining data writeback
 echo 20000 > /proc/sys/vm/dirty_writeback_centisecs
 echo 20000 > /proc/sys/vm/dirty_expire_centisecs
 # wait 35s to make sure xfs's cil log writeback
 sleep 35
 # triger system crash
 echo c > /proc/sysrq-trigger

After system reboot, the 'foo' file's size is 12K but have 10
allocated blocks.

 xfs_db> inode 131
 core.size = 12288
 core.nblocks = 10
 ...
 0:[0,24,3,0]
 1:[3,27,7,1]

I'm not expert on xfs, but I don't think this is the right result,
is it?

Thanks,
Yi.

> actually written you actually need to pass in the dirty range and not
> just use the whole folio.  This would be the incremental patch to do
> that:
> 
> http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752
> 
> But unless your block allocator is very cheap doing what XFS does is
> probably going to work much better.
> 
>> ...ok so here is the PoC for seq counter check for ext2. Next let me
>> try to see if we can lift this up from the FS side to iomap - 
> 
> This looks good to me from a very superficial view.  Dave is the expert
> on this, though.
> 
> 
> .
>