[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <644afc72-71bb-4201-8829-ccf3211d68b7@oracle.com>
Date: Mon, 24 Jun 2024 14:58:40 +0100
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: axboe@...nel.dk, tytso@....edu, dchinner@...hat.com,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.com,
chandan.babu@...cle.com, hch@....de, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-btrfs@...r.kernel.org,
linux-erofs@...ts.ozlabs.org, linux-ext4@...r.kernel.org,
linux-f2fs-devel@...ts.sourceforge.net, linux-fsdevel@...r.kernel.org,
gfs2@...ts.linux.dev, linux-xfs@...r.kernel.org,
catherine.hoang@...cle.com, ritesh.list@...il.com, mcgrof@...nel.org,
mikulas@...ax.karlin.mff.cuni.cz, agruenba@...hat.com,
miklos@...redi.hu, martin.petersen@...cle.com
Subject: Re: [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing
size
On 21/06/2024 22:18, Darrick J. Wong wrote:
> On Thu, Jun 13, 2024 at 11:31:35AM +0100, John Garry wrote:
>> On 12/06/2024 22:32, Darrick J. Wong wrote:
>>>> unsigned int fs_block_size = i_blocksize(inode), pad;
>>>> + u64 io_block_size = iomap->io_block_size;
>>> I wonder, should iomap be nice and not require filesystems to set
>>> io_block_size themselves unless they really need it?
>>
>> That's what I had in v3, like:
>>
>> if (iomap->io_block_size)
>> io_block_size = iomap->io_block_size;
>> else
>> io_block_size = i_block_size(inode)
>>
>> but it was suggested to change that (to like what I have here).
>
> oh, ok. Ignore that comment, then. :)
>
To be clear, Dave actually suggested that change.
>>> Anyone working on
>>> an iomap port while this patchset is in progress may or may not remember
>>> to add this bit if they get their port merged after atomicwrites is
>>> merged; and you might not remember to prevent the bitrot if the reverse
>>> order happens.
>>
>> Sure, I get your point.
>>
>> However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close to
>> all members of struct iomap, so we are just continuing that trend, i.e. it
>> is the job of the FS callback to set all these members.
>>
>>>
>>> u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);
>>>
>>>> loff_t length = iomap_length(iter);
>>>> loff_t pos = iter->pos;
>>>> blk_opf_t bio_opf;
>>>> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>>> int nr_pages, ret = 0;
>>>> size_t copied = 0;
>>>> size_t orig_count;
>>>> + unsigned int pad;
>>>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>>> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>>> if (need_zeroout) {
>>>> /* zero out from the start of the block to the write offset */
>>>> - pad = pos & (fs_block_size - 1);
>>>> + if (is_power_of_2(io_block_size)) {
>>>> + pad = pos & (io_block_size - 1);
>>>> + } else {
>>>> + loff_t _pos = pos;
>>>> +
>>>> + pad = do_div(_pos, io_block_size);
>>>> + }
>>> Please don't opencode this twice.
>>>
>>> static unsigned int offset_in_block(loff_t pos, u64 blocksize)
>>> {
>>> if (likely(is_power_of_2(blocksize)))
>>> return pos & (blocksize - 1);
>>> return do_div(pos, blocksize);
>>> }
>>
>> ok, fine
>>
>>>
>>> pad = offset_in_block(pos, io_block_size);
>>> if (pad)
>>> ...
>>>
>>> Also, what happens if pos-pad points to a byte before the mapping?
>>
>> It's the job of the FS to map in something aligned to io_block_size. Having
>> said that, I don't think we are doing that for XFS (which sets io_block_size
>>> i_block_size(inode)), so I need to check that.
>
> <nod> You can only play with the mapping that the fs gave you.
> If xfs doesn't give you a big enough mapping, then that's a programming
> bug to WARN_ON_ONCE about and return EIO.
I think that this is pretty easy to solve by just ensuring that for an
atomic write inode, the call xfs_direct_write_iomap_being() ->
xfs_bmapi_read(bno, len) is such that bno and len are extent size aligned.
>>
>> Naming is hard. Essentally we are trying to reuse the sub-fs block zeroing
>> code for sub-extent granule writes. More below.
>
> Yeah. For sub-fsblock zeroing we issue (chained) bios to write zeroes
> to the sectors surrounding the part we're actually writing, then we're
> issuing the write itself, and finally the ioend converts the mapping to
> unwritten.
>
> For untorn writes we're doing the same thing, but now on the level of
> multiple fsblocks. I guess this is all going to support a
>
>
> <nod> "IO granularity" ? For untorn writes I guess you want mappings
> that are aligned to a supported untorn write granularity; for bs > ps
> filesystems I guess the IO granularity is
For LBS, it's still going to be inode block size.
>>>
>>> <still confused about why we need to do this, maybe i'll figure it out
>>> as I go along>
>>
>> This zeroing is just really required for atomic writes. The purpose is to
>> zero the extent granule for any write within a newly allocated granule.
>>
>> Consider we have uuWu, above. If the user then attempts to write the full
>> 16K as an atomic write, the iomap iter code would generate writes for sizes
>> 8k, 4k, and 4k, i.e. not a single 16K write. This is not acceptable. So the
>> idea is to zero the full extent granule when allocated, so we have ZZWZ
>> after the 4k write at offset 8192, above. As such, if we then attempt this
>> 16K atomic write, we get a single 16K BIO, i.e. there is no unwritten extent
>> conversion.
>
> Wait, are we issuing zeroing writes for 0-8191 and 12288-16383, then
> issuing a single atomic write for 0-16383?
When we have uuuu and attempt the first 4k write @ offset 4k, we also
issue zeroes for 0-8191 and 12288-16383.
But this is done synchronously. We are leveraging the existing code to
issue the write with the exclusive IOLOCK in
xfs_file_dio_write_unaligned(), so no one else can access that data
while we do that initial write+zeroing to the extent.
> That won't work, because all
> the bios attached to an iomap_dio are submitted and execute
> asynchronously. I think you need ->iomap_begin to do XFS_BMAPI_ZERO
> allocations if the writes aren't aligned to the minimum untorn write
> granularity.
>
>> I am not sure if we should be doing this only for atomic writes inodes, or
>> also forcealign only or RT.
>
> I think it only applies to untorn writes because the default behavior
> everywhere is is that writes can tear.
>
ok, fine.
Thanks,
John
Powered by blists - more mailing lists