linux-kernel - Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f569d971-222a-4824-b5fe-2e0d8dc400cc@oracle.com>
Date: Tue, 5 Mar 2024 15:22:52 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: djwong@...nel.org, hch@....de, viro@...iv.linux.org.uk, brauner@...nel.org,
        jack@...e.cz, chandan.babu@...cle.com, axboe@...nel.dk,
        martin.petersen@...cle.com, linux-kernel@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        tytso@....edu, jbongio@...gle.com, ojaswin@...ux.ibm.com,
        ritesh.list@...il.com, linux-block@...r.kernel.org
Subject: Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the
 'forcealign' flag

On 05/03/2024 00:44, Dave Chinner wrote:
> On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <djwong@...nel.org>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
>> Co-developed-by: John Garry <john.g.garry@...cle.com>
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>>   fs/xfs/xfs_iomap.c       |  4 +++-
>>   2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 60d100134280..8dee60795cf4 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	/*
>> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> +	 * set as forcealign and cowextsz_hint are mutually exclusive
>> +	 */
>> +	if (xfs_inode_forcealign(ap->ip) && align) {
>> +		args->alignment = align;
>> +		if (stripe_align % align)
>> +			stripe_align = align;
> 
> This kinda does the right thing, but it strikes me as the wrong
> thing to be doing - it seems to conflates two different physical
> alignment properties in potentially bad ways, and we should never
> get this far into the allocator to discover these issues.
> 
> Stripe alignment is alignment to physical disks in a RAID array.
> Inode forced alignment is file offset alignment to specificly
> defined LBA address ranges.  The stripe unit/width is set at mkfs
> time, the inode extent size hint at runtime.
> 
> These can only work together in specific situations:
> 
> 	1. max sized RWF_ATOMIC IO must be the same or smaller size
> 	   than the stripe unit. Alternatively: stripe unit must be
> 	   an integer (power of 2?) multiple of max atomic IO size.

Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe 
alignment.

> 
> 	   IOWs, stripe unit vs atomic IO constraints must be
> 	   resolved in a compatible manner at mkfs time. If they
> 	   can't be resolved (e.g. non-power of 2 stripe unit) then
> 	   atomic IO cannot be done on the filesystem at all. Stripe
> 	   unit constraints always win over atomic IO.

We can could do that. Indeed, I thought our xfsprogs was doing that, but 
we have had a few versions now for forcealign ...

> 
> 	2. min sized RWF_ATOMIC IO must be an integer divider of
> 	   stripe unit so that small atomic IOs are always correctly
> 	   aligned to the stripe unit.

That's a given from atomic write rules and point 1.

> 
> 	3. Inode forced alignment constraints set at runtime (i.e.
> 	   extsize hints) must fit within the above stripe unit vs
> 	   min/max atomic IO requirements.
>  > 	   i.e. extent size hint will typically need to an integer
> 	   multiple of stripe unit size which will always be
> 	   compatible with stripe unit and atomic IO size and
> 	   alignments...

I think that any extsize hint for forcealign needs to comply with stripe 
unit, same as point 1.

> 
> IOWs, these are static geometry constraints and so should be checked
> and rejected at the point where alignments are specified (i.e.
> mkfs, mount and ioctls). Then the allocator can simply assume that
> forced inode alignments are always stripe alignment compatible and
> we don't need separate handling of two possibly incompatible
> alignments.

ok, makes sense.

Please note in case missed, I am mandating extsize hint for forcealign 
needs to be a power-of-2. It just makes life easier for all the 
sub-extent zero'ing later on.

Also we need to enforce that the AG count to be compliant with the 
extsize hint for forcealign; but since the extsize hint for forcealign 
needs to be compliant with stripe unit, above, and stripe unit would be 
compliant wth AG count (right?), then this would be a given.

> 
> Regardless, I think the code as it stands won't work correctly when
> a stripe unit is not set.
> 
> The correct functioning of this code appears to be dependent on
> stripe_align being set >= the extent size hint. If stripe align is
> not set, then what does (0 % align) return? If my checks are
> correct, this will return 0,

Correct

> and so the above code will end up with
> stripe_align = 0.
> 
> Now, consider that args->alignment is used to signal to the
> allocator what the -start alignment- of the allocation is supposed
> to be, whilst args->prod/args->mod are used to tell the allocator
> what the tail adjustment is supposed to be.
> 
> Stripe alignment wants start alignment, extent size hints want tail
> adjustment and force aligned allocation wants both start alignment
> and tail adjustment.

Right

> 
> However, the allocator overrides these depending on what it is
> doing. e.g. exact block EOF allocation turns off start alignment but
> has to ensure space is reserved for start alignment if it fails.
> This reserves space for stripe_align in args->minalignslop, but if
> force alignment has a start alignment requirement larger than stripe
> unit alignment, this code fails to reserve the necessary alignment
> slop for the force alignment code.
> 
> If we aren't doing exact block EOF allocation, then we do stripe
> unit alignment. Again, if this fails and the forced alignment is
> larger than stripe alignment, this code does not reserve space for
> the larger alignment in args->minalignslop, so it will also do the
> wrong when we fall back to an args->alignment > 1 allocation.
> 
> Failing to correctly account for minalignslop is known to cause
> accounting problems when the AG is very near empty, and the usual
> result of that is cancelling of a dirty allocation transaction and a
> forced shutdown. So this is something we need to get right.

For sure

> 
> More below....
> 
>> +	} else {
>> +		args->alignment = 1;
>> +	}
> 
> Just initialise the allocation args structure with a value of 1 like
> we already do?

It was being done in this way to have just a single place where the 
value is initialised. It can easily be kept as is.

> 
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>>   	args.minlen = args.maxlen = ap->minlen;
>>   	args.total = ap->total;
>>   
>> -	args.alignment = 1;
>>   	args.minalignslop = 0;
> 
> This likely breaks the debug code. This isn't intended for testing
> atomic write capability, it's for exercising low space allocation
> algorithms with worst case fragmentation patterns. This is only
> called when error injection is turned on to trigger this path, so we
> shouldn't need to support forced IO alignment here.

ok, it can be left untouched.

> 
>>   
>>   	args.minleft = ap->minleft;
>> @@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
>>   {
>>   	struct xfs_mount	*mp = args->mp;
>>   	struct xfs_perag	*caller_pag = args->pag;
>> +	int			orig_alignment = args->alignment;
>>   	int			error;
>>   
>>   	/*
>> @@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
>>   
>>   	/*
>>   	 * Allocation failed, so turn return the allocation args to their
>> -	 * original non-aligned state so the caller can proceed on allocation
>> -	 * failure as if this function was never called.
>> +	 * original state so the caller can proceed on allocation failure as
>> +	 * if this function was never called.
>>   	 */
>> -	args->alignment = 1;
>> +	args->alignment = orig_alignment;
>>   	return 0;
>>   }
> 
> As I said above, we can't set an alignment of > 1 here if we haven't
> accounted for that alignment in args->minalignslop above. This leads
> to unexpected ENOSPC conditions and filesystem shutdowns.
> 
> I suspect what we need to do is get rid of the separate stripe_align
> variable altogether and always just set args->alignment to what we
> need the extent start alignment to be, regardless of whether it is
> from stripe alignment or forced alignment.

ok, it sounds a bit simpler at least

> 
> Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
> 'stripe_align' is - the exact EOF block allocation can simply save
> and restore the args->alignment value and use it for minalignslop
> calculations for the initial exact block allocation.
> 
> Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
> is true, we can abort allocation immediately, and not bother to fall
> back on further aligned/unaligned attempts that will also fail or do
> the wrong them.

ok

> 
> Similarly, if we aren't doing EOF allocation, having args->alignment
> set means it will do the right thing for the first allocation
> attempt. Again, if that fails, we can check if
> xfs_inode_forcealign() is true and fail the aligned allocation
> instead of running the low space algorithm. This now makes it clear
> that we're failing the allocation because of the forced alignment
> requirement, and now the low space allocation code can explicitly
> turn off start alignment as it isn't required...

are you saying that low-space allocator can set args->alignment = 1 to 
be explicit?

> 
> 
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>>   		 * If mounted with the "-o swalloc" option the alignment is
>>   		 * increased from the strip unit size to the stripe width.
>>   		 */
>> -		if (mp->m_swidth && xfs_has_swalloc(mp))
>> +		if (xfs_inode_forcealign(ip))

I actually thought that I dropped this chunk as it was causing alignment 
issues. I need to check that again.

>> +			align = xfs_get_extsz_hint(ip);
>> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>>   			align = mp->m_swidth;
>>   		else if (mp->m_dalign)
>>   			align = mp->m_dalign;
> 
> This doesn't work for files with a current size of less than
> "align". The next line of code does:
> 
> 	if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
> 		align = 0;
> 
> IOWs, the first aligned write allocation will occur with a file size
> of zero, and that will result in this function returning 0. i.e.
> speculative post EOF delalloc doesn't turn on and align the EOF
> until writes up to offset == align have already been completed.

Maybe it was working as I have the change to keep EOF aligned for 
forcealign. I'll check that.

> 
> Essentially, this code wants to be:
> 
> 	if (xfs_inode_forcealign(ip))
> 		return xfs_get_extsz_hint(ip);
> 
> So that any write to the a force aligned inode will always trigger
> extent size hint EOF alignment.
> 

ok, sounds reasonable.

Thanks,
John