linux-kernel - Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <644afc72-71bb-4201-8829-ccf3211d68b7@oracle.com>
Date: Mon, 24 Jun 2024 14:58:40 +0100
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: axboe@...nel.dk, tytso@....edu, dchinner@...hat.com,
        viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.com,
        chandan.babu@...cle.com, hch@....de, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-btrfs@...r.kernel.org,
        linux-erofs@...ts.ozlabs.org, linux-ext4@...r.kernel.org,
        linux-f2fs-devel@...ts.sourceforge.net, linux-fsdevel@...r.kernel.org,
        gfs2@...ts.linux.dev, linux-xfs@...r.kernel.org,
        catherine.hoang@...cle.com, ritesh.list@...il.com, mcgrof@...nel.org,
        mikulas@...ax.karlin.mff.cuni.cz, agruenba@...hat.com,
        miklos@...redi.hu, martin.petersen@...cle.com
Subject: Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing
 size

On 21/06/2024 22:18, Darrick J. Wong wrote:
> On Thu, Jun 13, 2024 at 11:31:35AM +0100, John Garry wrote:
>> On 12/06/2024 22:32, Darrick J. Wong wrote:
>>>> unsigned int fs_block_size = i_blocksize(inode), pad;
>>>> +	u64 io_block_size = iomap->io_block_size;
>>> I wonder, should iomap be nice and not require filesystems to set
>>> io_block_size themselves unless they really need it?
>>
>> That's what I had in v3, like:
>>
>> if (iomap->io_block_size)
>> 	io_block_size = iomap->io_block_size;
>> else
>> 	io_block_size = i_block_size(inode)
>>
>> but it was suggested to change that (to like what I have here).
> 
> oh, ok.  Ignore that comment, then. :)
> 

To be clear, Dave actually suggested that change.

>>> Anyone working on
>>> an iomap port while this patchset is in progress may or may not remember
>>> to add this bit if they get their port merged after atomicwrites is
>>> merged; and you might not remember to prevent the bitrot if the reverse
>>> order happens.
>>
>> Sure, I get your point.
>>
>> However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close to
>> all members of struct iomap, so we are just continuing that trend, i.e. it
>> is the job of the FS callback to set all these members.
>>
>>>
>>> 	u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);
>>>
>>>>    	loff_t length = iomap_length(iter);
>>>>    	loff_t pos = iter->pos;
>>>>    	blk_opf_t bio_opf;
>>>> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>>>    	int nr_pages, ret = 0;
>>>>    	size_t copied = 0;
>>>>    	size_t orig_count;
>>>> +	unsigned int pad;
>>>>    	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>>>    	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>>> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>>>    	if (need_zeroout) {
>>>>    		/* zero out from the start of the block to the write offset */
>>>> -		pad = pos & (fs_block_size - 1);
>>>> +		if (is_power_of_2(io_block_size)) {
>>>> +			pad = pos & (io_block_size - 1);
>>>> +		} else {
>>>> +			loff_t _pos = pos;
>>>> +
>>>> +			pad = do_div(_pos, io_block_size);
>>>> +		}
>>> Please don't opencode this twice.
>>>
>>> static unsigned int offset_in_block(loff_t pos, u64 blocksize)
>>> {
>>> 	if (likely(is_power_of_2(blocksize)))
>>> 		return pos & (blocksize - 1);
>>> 	return do_div(pos, blocksize);
>>> }
>>
>> ok, fine
>>
>>>
>>> 		pad = offset_in_block(pos, io_block_size);
>>> 		if (pad)
>>> 			...
>>>
>>> Also, what happens if pos-pad points to a byte before the mapping?
>>
>> It's the job of the FS to map in something aligned to io_block_size. Having
>> said that, I don't think we are doing that for XFS (which sets io_block_size
>>> i_block_size(inode)), so I need to check that.
> 
> <nod>  You can only play with the mapping that the fs gave you.
> If xfs doesn't give you a big enough mapping, then that's a programming
> bug to WARN_ON_ONCE about and return EIO.


I think that this is pretty easy to solve by just ensuring that for an 
atomic write inode, the call xfs_direct_write_iomap_being() -> 
xfs_bmapi_read(bno, len) is such that bno and len are extent size aligned.

>>
>> Naming is hard. Essentally we are trying to reuse the sub-fs block zeroing
>> code for sub-extent granule writes. More below.
> 
> Yeah.  For sub-fsblock zeroing we issue (chained) bios to write zeroes
> to the sectors surrounding the part we're actually writing, then we're
> issuing the write itself, and finally the ioend converts the mapping to
> unwritten.
> 
> For untorn writes we're doing the same thing, but now on the level of
> multiple fsblocks.  I guess this is all going to support a
> 
> 
> <nod> "IO granularity" ?  For untorn writes I guess you want mappings
> that are aligned to a supported untorn write granularity; for bs > ps
> filesystems I guess the IO granularity is

For LBS, it's still going to be inode block size.


>>>
>>> <still confused about why we need to do this, maybe i'll figure it out
>>> as I go along>
>>
>> This zeroing is just really required for atomic writes. The purpose is to
>> zero the extent granule for any write within a newly allocated granule.
>>
>> Consider we have uuWu, above. If the user then attempts to write the full
>> 16K as an atomic write, the iomap iter code would generate writes for sizes
>> 8k, 4k, and 4k, i.e. not a single 16K write. This is not acceptable. So the
>> idea is to zero the full extent granule when allocated, so we have ZZWZ
>> after the 4k write at offset 8192, above. As such, if we then attempt this
>> 16K atomic write, we get a single 16K BIO, i.e. there is no unwritten extent
>> conversion.
> 
> Wait, are we issuing zeroing writes for 0-8191 and 12288-16383, then
> issuing a single atomic write for 0-16383? 

When we have uuuu and attempt the first 4k write @ offset 4k, we also 
issue zeroes for 0-8191 and 12288-16383.

But this is done synchronously.  We are leveraging the existing code to 
issue the write with the exclusive IOLOCK in 
xfs_file_dio_write_unaligned(), so no one else can access that data 
while we do that initial write+zeroing to the extent.

> That won't work, because all
> the bios attached to an iomap_dio are submitted and execute
> asynchronously.  I think you need ->iomap_begin to do XFS_BMAPI_ZERO
> allocations if the writes aren't aligned to the minimum untorn write
> granularity.
> 
>> I am not sure if we should be doing this only for atomic writes inodes, or
>> also forcealign only or RT.
> 
> I think it only applies to untorn writes because the default behavior
> everywhere is is that writes can tear.
> 

ok, fine.

Thanks,
John