linux-kernel - Re: [PATCH v4 00/14] forcealign for xfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtlQt/7VHbOtQ+gY@dread.disaster.area>
Date: Thu, 5 Sep 2024 16:33:27 +1000
From: Dave Chinner <david@...morbit.com>
To: Ritesh Harjani <ritesh.list@...il.com>
Cc: John Garry <john.g.garry@...cle.com>, chandan.babu@...cle.com,
	djwong@...nel.org, dchinner@...hat.com, hch@....de,
	viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, catherine.hoang@...cle.com,
	martin.petersen@...cle.com
Subject: Re: [PATCH v4 00/14] forcealign for xfs

On Thu, Sep 05, 2024 at 09:26:25AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@...morbit.com> writes:
> > On Wed, Sep 04, 2024 at 11:44:29PM +0530, Ritesh Harjani wrote:
> >> 3. It is the FORCEALIGN feature which _mandates_ both allocation
> >> (by using extsize hint) and de-allocation to happen _only_ in
> >> extsize chunks.
> >>
> >>    i.e. forcealign mandates -
> >>    - the logical and physical start offset should be aligned as
> >>    per args->alignment
> >>    - extent length be aligned as per args->prod/mod.
> >>      If above two cannot be satisfied then return -ENOSPC.
> >
> > Yes.
> >
> >> 
> >>    - Does the unmapping of extents also only happens in extsize
> >>    chunks (with forcealign)?
> >
> > Yes, via use of xfs_inode_alloc_unitsize() in the high level code
> > aligning the fsbno ranges to be unmapped.
> >
> > Remember, force align requires both logical file offset and
> > physical block number to be correctly aligned,
> 
> This is where I would like to double confirm it again. Even the
> extsize hint feature (w/o FORCEALIGN) will try to allocate aligned
> physical start and logical start file offset and length right?

No.

> (Or does extsize hint only restricts alignment to logical start file
> offset + length and not the physical start?)

Neither.

extsize hint by itself (i.e. existing behaviour) has no alignment
effect at all. All it affects is -size- of the extent. i.e. once
the extent start is chosen, extent size hints will trim the length
of the extent to a multiple of the extent size hint. Alignment is
not considered at all.

> Also it looks like there is no difference with ATOMIC_WRITE AND
> FORCEALIGN feature with XFS, correct? (except that ATOMIC_WRITE is
> adding additional natural alignment restrictions on pos and len). 

Atomic write requires additional hardware support, and it restricts
the valid sizes of extent size hints that can be set. Only atomic
writes can be done on files marked as configured for atomic writes;
force alignment can be done on any file...

> So why maintain 2 separate on disk inode flags for FORCEALIGN AND
> ATOMIC_WRITE?

the atomic write flag indicates that a file has been set up
correctly for atomic writes to be able to issues reliably. force
alignment doesn't guarantee that - it's just a mechanism that tells
the allocator to behave a specific way.

> - Do you foresee FORCEALIGN to be also used at other places w/o
> ATOMIC_WRITE where feature differentiation between the two on an
> inode is required?

The already exist. For example, reliably allocating huge page
mappings on DAX filesystems requires 2MB forced alignment. 

> - Does the same reasoning will hold for XFS_SB_FEAT_RO_COMPAT_FORCEALIGN
> & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES too?

Same as above.

> - But why ro_compact for ATOMICWRITES? There aren't any on disk metadata
> changes within XFS filesystem to support atomic writes, right? 

Because if you downgrade the kernel to something that doesn't
support atomic writes, then non-atomic sized/aligned data can be
written to the file and/or torn writes can occur.

Worse, extent size hints that don't match the underlying hardware
support could be set up for inodes, and when the kernel is upgraded
again then atomic writes will fail on inodes that have atomic write
flags set on them....

> Is it something to just prevent users from destroying their own data
> by not allowing a rw mount from an older kernel where users could do
> unaligned writes to files marked for atomic writes?
> Or is there any other reasoning to prevent XFS filesystem from becoming
> inconsistent if an older kernel does a rw mount here.

The older kernel does not know what the unknown inode flag means
(i.e. atomic writes) and so, by definition, we cannot allow it to
modify metadata or file data because it may not modify it in the
correct way for that flag being set on the inode.

Kernels that don't understand feature flags need to treat the
filesystem as read-only, no matter how trivial the feature addition
might seem.

> > so unmap alignment
> > has to be set up correctly at file offset level before we even know
> > what extents underly the file range we need to unmap....
> >
> >>      If the start or end of the extent which needs unmapping is
> >>      unaligned then we convert that extent to unwritten and skip,
> >>      is it? (__xfs_bunmapi())
> >
> > The high level code should be aligning the start and end of the
> > file range to be removed via xfs_inode_alloc_unitsize(). Hence 
> > the low level __xfs_bunmapi() code shouldn't ever be encountering
> > unaligned unmaps on force-aligned inodes.
> >
> 
> Yes, but isn't this code snippet trying to handle a case when it finds an
> unaligned extent during unmap?

It does exactly that.

> And what we are essentially trying to 
> do here is leave the unwritten extent as is and if the found extent is
> written then convert to unwritten and skip it (goto nodelete). This
> means with forcealign if we encounter an unaligned extent then the file
> will have this space reserved as is with extent marked unwritten. 
> 
> Is this understanding correct?

Yes, except for the fact that this code is not triggered by force
alignment.

This code is preexisting realtime file functionality. It is used
when the rtextent size is larger than a single filesytem block.

In these configurations, we do allocation in rtextsize units, but we
track written/unwritten extents in the BMBT on filesystem block
granularity. Hence we can have unaligned written/unwritten extent
boundaries, and if we aren't unmapping a whole rtextent then we
simply mark the unused portion of it as unwritten in the BMBT to
indicate it contains zeroes.

IOWs, existing RT files have different IO alignment and
written/unwritten extent conversion behaviour to the new forced
alignment feature. The implementation code is shared in many places,
but the higher level forced alignment functionality ensures there
are never unaligned unwritten extents created or unaligned
unmappings asked for. Hence this code does not trigger for the
forced alignment cases.

i.e. we have multiple seperate allocation alignment behaviours that
share an implementation, but they do not all behave exactly the same
way or provide the same alignment guarantees....

-Dave.
-- 
Dave Chinner
david@...morbit.com