[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7caf7f2-837d-4cfd-afd0-123a99f6fee5@oracle.com>
Date: Thu, 13 Jun 2024 12:13:45 +0100
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: axboe@...nel.dk, tytso@....edu, dchinner@...hat.com,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.com,
chandan.babu@...cle.com, hch@....de, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-btrfs@...r.kernel.org,
linux-erofs@...ts.ozlabs.org, linux-ext4@...r.kernel.org,
linux-f2fs-devel@...ts.sourceforge.net, linux-fsdevel@...r.kernel.org,
gfs2@...ts.linux.dev, linux-xfs@...r.kernel.org,
catherine.hoang@...cle.com, ritesh.list@...il.com, mcgrof@...nel.org,
mikulas@...ax.karlin.mff.cuni.cz, agruenba@...hat.com,
miklos@...redi.hu, martin.petersen@...cle.com
Subject: Re: [PATCH v4 03/22] xfs: Use extent size granularity for
iomap->io_block_size
On 12/06/2024 22:47, Darrick J. Wong wrote:
> On Fri, Jun 07, 2024 at 02:39:00PM +0000, John Garry wrote:
>> Currently iomap->io_block_size is set to the i_blocksize() value for the
>> inode.
>>
>> Expand the sub-fs block size zeroing to now cover RT extents, by calling
>> setting iomap->io_block_size as xfs_inode_alloc_unitsize().
>>
>> In xfs_iomap_write_unwritten(), update the unwritten range fsb to cover
>> this extent granularity.
>>
>> In xfs_file_dio_write(), handle a write which is not aligned to extent
>> size granularity as unaligned. Since the extent size granularity need not
>> be a power-of-2, handle this also.
>>
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>> fs/xfs/xfs_file.c | 24 +++++++++++++++++++-----
>> fs/xfs/xfs_inode.c | 17 +++++++++++------
>> fs/xfs/xfs_inode.h | 1 +
>> fs/xfs/xfs_iomap.c | 8 +++++++-
>> 4 files changed, 38 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index b240ea5241dc..24fe3c2e03da 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -601,7 +601,7 @@ xfs_file_dio_write_aligned(
>> }
>>
>> /*
>> - * Handle block unaligned direct I/O writes
>> + * Handle unaligned direct IO writes.
>> *
>> * In most cases direct I/O writes will be done holding IOLOCK_SHARED, allowing
>> * them to be done in parallel with reads and other direct I/O writes. However,
>> @@ -630,9 +630,9 @@ xfs_file_dio_write_unaligned(
>> ssize_t ret;
>>
>> /*
>> - * Extending writes need exclusivity because of the sub-block zeroing
>> - * that the DIO code always does for partial tail blocks beyond EOF, so
>> - * don't even bother trying the fast path in this case.
>> + * Extending writes need exclusivity because of the sub-block/extent
>> + * zeroing that the DIO code always does for partial tail blocks
>> + * beyond EOF, so don't even bother trying the fast path in this case.
>
> Hummm. So let's say the fsblock size is 4k, the rt extent size is 16k,
> and you want to write bytes 8192-12287 of a file. Currently we'd use
> xfs_file_dio_write_aligned for that, but now we'd use
> xfs_file_dio_write_unaligned? Even though we don't need zeroing or any
> of that stuff?
Right, this is something which I mentioned in response to the previous
patch.
I doubt whether we should only do this for atomic writes inodes, or also
RT and forcealign-only inodes.
I got the impression from Dave in review of the previous version of this
series that it should include RT and forcealign-only.
>
>> */
>> if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>> if (iocb->ki_flags & IOCB_NOWAIT)
>> @@ -698,11 +698,25 @@ xfs_file_dio_write(
>> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
>> struct xfs_buftarg *target = xfs_inode_buftarg(ip);
>> size_t count = iov_iter_count(from);
>> + bool unaligned;
>> + u64 unitsize;
>>
>> /* direct I/O must be aligned to device logical sector size */
>> if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>> return -EINVAL;
>> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>> +
>> + unitsize = xfs_inode_alloc_unitsize(ip);
>> + if (!is_power_of_2(unitsize)) {
>> + if (isaligned_64(iocb->ki_pos, unitsize) &&
>> + isaligned_64(count, unitsize))
>> + unaligned = false;
>> + else
>> + unaligned = true;
>> + } else {
>> + unaligned = (iocb->ki_pos | count) & (unitsize - 1);
>> + }
>
> Didn't I already write this?
It's from xfs_is_falloc_aligned(). Let's reuse that fully here. I did
look at doing that before, though...
>
>> + if (unaligned)
>
> if (!xfs_is_falloc_aligned(ip, iocb->ki_pos, count))
>
>> return xfs_file_dio_write_unaligned(ip, iocb, from);
>> return xfs_file_dio_write_aligned(ip, iocb, from);
>> }
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 58fb7a5062e1..93ad442f399b 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -4264,15 +4264,20 @@ xfs_break_layouts(
>> return error;
>> }
>>
>> -/* Returns the size of fundamental allocation unit for a file, in bytes. */
>
> Don't delete the comment, it has useful return type information.
It wasn't deleted, it is still below.
>
> /*
> * Returns the size of fundamental allocation unit for a file, in
> * fsblocks.
> */
>
>> unsigned int
>> -xfs_inode_alloc_unitsize(
>> +xfs_inode_alloc_unitsize_fsb(
>> struct xfs_inode *ip)
>> {
>> - unsigned int blocks = 1;
>> -
>> if (XFS_IS_REALTIME_INODE(ip))
>> - blocks = ip->i_mount->m_sb.sb_rextsize;
>> + return ip->i_mount->m_sb.sb_rextsize;
>> +
>> + return 1;
>> +}
>>
>> - return XFS_FSB_TO_B(ip->i_mount, blocks);
>> +/* Returns the size of fundamental allocation unit for a file, in bytes. */
>> +unsigned int
>> +xfs_inode_alloc_unitsize(
>> + struct xfs_inode *ip)
>> +{
>> + return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_unitsize_fsb(ip));
>> }
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 292b90b5f2ac..90d2fa837117 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -643,6 +643,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
>> bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
>> void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>> xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
>> +unsigned int xfs_inode_alloc_unitsize_fsb(struct xfs_inode *ip);
>> unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
>>
>> struct xfs_dir_update_params {
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ecb4cae88248..fbe69f747e30 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -127,7 +127,7 @@ xfs_bmbt_to_iomap(
>> }
>> iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
>> iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);
>> - iomap->io_block_size = i_blocksize(VFS_I(ip));
>> + iomap->io_block_size = xfs_inode_alloc_unitsize(ip);
>
> Oh, I see. So io_block_size causes iomap to write zeroes to the storage
> backing surrounding areas of the file range.
Yes
> In this case, for direct
> writes to the unwritten middle 4k of an otherwise written 16k extent,
> we'll write zeroes to 0-4k and 8k-16k even though that wasn't what the
> caller asked for?
We would only do that for a newly allocated extent. We should not
overwrite existing data.
>
> IOWs, if you start with:
>
> WWuW
>
> write to the "U", then it'll write zeroes to the "W" areas? That
> doesn't sound good...
No, that definitely should not happen.
We only would zero once when do a sub-extent granule write to an
unallocated extent.
In iomap_dio_bio_iter(), we only zero for IOMAP_UNWRITTEN or IOMAP_F_NEW.
>
>> if (mapping_flags & IOMAP_DAX)
>> iomap->dax_dev = target->bt_daxdev;
>> else
>> @@ -577,11 +577,17 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + unsigned int rounding;
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + rounding = xfs_inode_alloc_unitsize_fsb(ip);
>> + if (rounding > 1) {
>> + offset_fsb = rounddown_64(offset_fsb, rounding);
>> + count_fsb = roundup_64(count_fsb, rounding);
>> + }
>
> ...and then the ioend handler is supposed to be smart enough to know
> that iomap quietly wrote to other parts of the disk.
iomap_io_complete() only knows about the non-zeroing written data. I am
not changing that really.
>
> Um, does this cause unwritten extent conversion for entire rtextents
> after writeback to a rtextsize > 1fsb file?
Yes.
>
> Or am I really misunderstanding what's going on here with the io paths?
Thanks,
John
Powered by blists - more mailing lists