[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <66204b26-7a8a-4c21-8728-aa3c5af5822c@oracle.com>
Date: Thu, 7 Mar 2024 12:57:06 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: djwong@...nel.org, hch@....de, viro@...iv.linux.org.uk, brauner@...nel.org,
jack@...e.cz, chandan.babu@...cle.com, axboe@...nel.dk,
martin.petersen@...cle.com, linux-kernel@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
tytso@....edu, jbongio@...gle.com, ojaswin@...ux.ibm.com,
ritesh.list@...il.com, linux-block@...r.kernel.org
Subject: Re: [PATCH v2 08/14] fs: xfs: iomap: Sub-extent zeroing
>>
>> if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
>> return xfs_alert_fsblock_zero(ip, imap);
>> @@ -134,6 +135,8 @@ xfs_bmbt_to_iomap(
>>
>> iomap->validity_cookie = sequence_cookie;
>> iomap->folio_ops = &xfs_iomap_folio_ops;
>> + if (extsz > 1)
>> + iomap->extent_shift = ffs(extsz) - 1;
>
> iomap->extent_size = mp->m_bsize;
> if (xfs_inode_has_force_align(ip))
> iomap->extent_size *= ip->i_extsize;
ok, fine
>
>> @@ -563,11 +566,19 @@ xfs_iomap_write_unwritten(
>> xfs_fsize_t i_size;
>> uint resblks;
>> int error;
>> + xfs_extlen_t extsz = xfs_get_extsz(ip);
>>
>> trace_xfs_unwritten_convert(ip, offset, count);
>>
>> - offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> - count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + if (extsz > 1) {
>> + xfs_extlen_t extsize_bytes = XFS_FSB_TO_B(mp, extsz);
>> +
>> + offset_fsb = XFS_B_TO_FSBT(mp, round_down(offset, extsize_bytes));
>> + count_fsb = XFS_B_TO_FSB(mp, round_up(offset + count, extsize_bytes));
>> + } else {
>> + offset_fsb = XFS_B_TO_FSBT(mp, offset);
>> + count_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)offset + count);
>> + }
>
> I don't think this is correct. We should only be converting the
> extent when the entire range has had data written to it. If we are
> doing unaligned writes, we end up running 3 separate unwritten
> conversion transactions - the leading zeroing, the data written and
> the trailing zeroing.
Then I missed that in the code.
For sub-FS block conversion, I thought that this was doing the complete
FS blocks conversion, including for the head and tail zeros. And now for
sub-extent writes, we would be similarly doing the full extent
conversion, including head and tail zeros.
>
> This will end up converting the entire range to written when the
> leading zeroing completes, exposing stale data until the data and
> trailing zeroing completes.
That would not be good.
>
> Concurrent reads (both DIO and buffered) can see this stale data
> while the write is in progress, leading to a mechanism where a user
> can issue sub-atomic write range IO and concurrent overlapping reads
> to read arbitrary stale data from the disk just before it is
> overwritten.
>
> I suspect the only way to fix this for sub-force-aligned DIo writes
> if for iomap_dio_bio_iter() to chain the zeroing and data bios so
> the entire range gets a single completion run on it instead of three
> separate sub-aligned extent IO completions. We only need to do this
> in the zeroing case - this is already the DIo slow path, so
> additional submission overhead is not an issue. It would, however,
> reduce completion overhead and latency, as we only need to run a
> single extent conversion instead of 3, so chaining the bios on
> aligned writes may well be a net win...
>
ok, I'll check that idea.
> Thoughts? Christoph probably needs to weigh in on this one...
>
ok
Cheers,
John
Powered by blists - more mailing lists