[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f569d971-222a-4824-b5fe-2e0d8dc400cc@oracle.com>
Date: Tue, 5 Mar 2024 15:22:52 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: djwong@...nel.org, hch@....de, viro@...iv.linux.org.uk, brauner@...nel.org,
jack@...e.cz, chandan.babu@...cle.com, axboe@...nel.dk,
martin.petersen@...cle.com, linux-kernel@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
tytso@....edu, jbongio@...gle.com, ojaswin@...ux.ibm.com,
ritesh.list@...il.com, linux-block@...r.kernel.org
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the
'forcealign' flag
On 05/03/2024 00:44, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <djwong@...nel.org>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
>> Co-developed-by: John Garry <john.g.garry@...cle.com>
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>> fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>> fs/xfs/xfs_iomap.c | 4 +++-
>> 2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 60d100134280..8dee60795cf4 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>> align = xfs_get_cowextsz_hint(ap->ip);
>> else if (ap->datatype & XFS_ALLOC_USERDATA)
>> align = xfs_get_extsz_hint(ap->ip);
>> +
>> + /*
>> + * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> + * set as forcealign and cowextsz_hint are mutually exclusive
>> + */
>> + if (xfs_inode_forcealign(ap->ip) && align) {
>> + args->alignment = align;
>> + if (stripe_align % align)
>> + stripe_align = align;
>
> This kinda does the right thing, but it strikes me as the wrong
> thing to be doing - it seems to conflates two different physical
> alignment properties in potentially bad ways, and we should never
> get this far into the allocator to discover these issues.
>
> Stripe alignment is alignment to physical disks in a RAID array.
> Inode forced alignment is file offset alignment to specificly
> defined LBA address ranges. The stripe unit/width is set at mkfs
> time, the inode extent size hint at runtime.
>
> These can only work together in specific situations:
>
> 1. max sized RWF_ATOMIC IO must be the same or smaller size
> than the stripe unit. Alternatively: stripe unit must be
> an integer (power of 2?) multiple of max atomic IO size.
Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe
alignment.
>
> IOWs, stripe unit vs atomic IO constraints must be
> resolved in a compatible manner at mkfs time. If they
> can't be resolved (e.g. non-power of 2 stripe unit) then
> atomic IO cannot be done on the filesystem at all. Stripe
> unit constraints always win over atomic IO.
We can could do that. Indeed, I thought our xfsprogs was doing that, but
we have had a few versions now for forcealign ...
>
> 2. min sized RWF_ATOMIC IO must be an integer divider of
> stripe unit so that small atomic IOs are always correctly
> aligned to the stripe unit.
That's a given from atomic write rules and point 1.
>
> 3. Inode forced alignment constraints set at runtime (i.e.
> extsize hints) must fit within the above stripe unit vs
> min/max atomic IO requirements.
> > i.e. extent size hint will typically need to an integer
> multiple of stripe unit size which will always be
> compatible with stripe unit and atomic IO size and
> alignments...
I think that any extsize hint for forcealign needs to comply with stripe
unit, same as point 1.
>
> IOWs, these are static geometry constraints and so should be checked
> and rejected at the point where alignments are specified (i.e.
> mkfs, mount and ioctls). Then the allocator can simply assume that
> forced inode alignments are always stripe alignment compatible and
> we don't need separate handling of two possibly incompatible
> alignments.
ok, makes sense.
Please note in case missed, I am mandating extsize hint for forcealign
needs to be a power-of-2. It just makes life easier for all the
sub-extent zero'ing later on.
Also we need to enforce that the AG count to be compliant with the
extsize hint for forcealign; but since the extsize hint for forcealign
needs to be compliant with stripe unit, above, and stripe unit would be
compliant wth AG count (right?), then this would be a given.
>
> Regardless, I think the code as it stands won't work correctly when
> a stripe unit is not set.
>
> The correct functioning of this code appears to be dependent on
> stripe_align being set >= the extent size hint. If stripe align is
> not set, then what does (0 % align) return? If my checks are
> correct, this will return 0,
Correct
> and so the above code will end up with
> stripe_align = 0.
>
> Now, consider that args->alignment is used to signal to the
> allocator what the -start alignment- of the allocation is supposed
> to be, whilst args->prod/args->mod are used to tell the allocator
> what the tail adjustment is supposed to be.
>
> Stripe alignment wants start alignment, extent size hints want tail
> adjustment and force aligned allocation wants both start alignment
> and tail adjustment.
Right
>
> However, the allocator overrides these depending on what it is
> doing. e.g. exact block EOF allocation turns off start alignment but
> has to ensure space is reserved for start alignment if it fails.
> This reserves space for stripe_align in args->minalignslop, but if
> force alignment has a start alignment requirement larger than stripe
> unit alignment, this code fails to reserve the necessary alignment
> slop for the force alignment code.
>
> If we aren't doing exact block EOF allocation, then we do stripe
> unit alignment. Again, if this fails and the forced alignment is
> larger than stripe alignment, this code does not reserve space for
> the larger alignment in args->minalignslop, so it will also do the
> wrong when we fall back to an args->alignment > 1 allocation.
>
> Failing to correctly account for minalignslop is known to cause
> accounting problems when the AG is very near empty, and the usual
> result of that is cancelling of a dirty allocation transaction and a
> forced shutdown. So this is something we need to get right.
For sure
>
> More below....
>
>> + } else {
>> + args->alignment = 1;
>> + }
>
> Just initialise the allocation args structure with a value of 1 like
> we already do?
It was being done in this way to have just a single place where the
value is initialised. It can easily be kept as is.
>
>> +
>> if (align) {
>> if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>> ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>> args.minlen = args.maxlen = ap->minlen;
>> args.total = ap->total;
>>
>> - args.alignment = 1;
>> args.minalignslop = 0;
>
> This likely breaks the debug code. This isn't intended for testing
> atomic write capability, it's for exercising low space allocation
> algorithms with worst case fragmentation patterns. This is only
> called when error injection is turned on to trigger this path, so we
> shouldn't need to support forced IO alignment here.
ok, it can be left untouched.
>
>>
>> args.minleft = ap->minleft;
>> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>> {
>> struct xfs_mount *mp = args->mp;
>> struct xfs_perag *caller_pag = args->pag;
>> + int orig_alignment = args->alignment;
>> int error;
>>
>> /*
>> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>>
>> /*
>> * Allocation failed, so turn return the allocation args to their
>> - * original non-aligned state so the caller can proceed on allocation
>> - * failure as if this function was never called.
>> + * original state so the caller can proceed on allocation failure as
>> + * if this function was never called.
>> */
>> - args->alignment = 1;
>> + args->alignment = orig_alignment;
>> return 0;
>> }
>
> As I said above, we can't set an alignment of > 1 here if we haven't
> accounted for that alignment in args->minalignslop above. This leads
> to unexpected ENOSPC conditions and filesystem shutdowns.
>
> I suspect what we need to do is get rid of the separate stripe_align
> variable altogether and always just set args->alignment to what we
> need the extent start alignment to be, regardless of whether it is
> from stripe alignment or forced alignment.
ok, it sounds a bit simpler at least
>
> Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> 'stripe_align' is - the exact EOF block allocation can simply save
> and restore the args->alignment value and use it for minalignslop
> calculations for the initial exact block allocation.
>
> Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> is true, we can abort allocation immediately, and not bother to fall
> back on further aligned/unaligned attempts that will also fail or do
> the wrong them.
ok
>
> Similarly, if we aren't doing EOF allocation, having args->alignment
> set means it will do the right thing for the first allocation
> attempt. Again, if that fails, we can check if
> xfs_inode_forcealign() is true and fail the aligned allocation
> instead of running the low space algorithm. This now makes it clear
> that we're failing the allocation because of the forced alignment
> requirement, and now the low space allocation code can explicitly
> turn off start alignment as it isn't required...
are you saying that low-space allocator can set args->alignment = 1 to
be explicit?
>
>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>> * If mounted with the "-o swalloc" option the alignment is
>> * increased from the strip unit size to the stripe width.
>> */
>> - if (mp->m_swidth && xfs_has_swalloc(mp))
>> + if (xfs_inode_forcealign(ip))
I actually thought that I dropped this chunk as it was causing alignment
issues. I need to check that again.
>> + align = xfs_get_extsz_hint(ip);
>> + else if (mp->m_swidth && xfs_has_swalloc(mp))
>> align = mp->m_swidth;
>> else if (mp->m_dalign)
>> align = mp->m_dalign;
>
> This doesn't work for files with a current size of less than
> "align". The next line of code does:
>
> if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
> align = 0;
>
> IOWs, the first aligned write allocation will occur with a file size
> of zero, and that will result in this function returning 0. i.e.
> speculative post EOF delalloc doesn't turn on and align the EOF
> until writes up to offset == align have already been completed.
Maybe it was working as I have the change to keep EOF aligned for
forcealign. I'll check that.
>
> Essentially, this code wants to be:
>
> if (xfs_inode_forcealign(ip))
> return xfs_get_extsz_hint(ip);
>
> So that any write to the a force aligned inode will always trigger
> extent size hint EOF alignment.
>
ok, sounds reasonable.
Thanks,
John
Powered by blists - more mailing lists