[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZjLjYsjTJGSdWZ9q@dread.disaster.area>
Date: Thu, 2 May 2024 10:50:42 +1000
From: Dave Chinner <david@...morbit.com>
To: John Garry <john.g.garry@...cle.com>
Cc: djwong@...nel.org, hch@....de, viro@...iv.linux.org.uk,
brauner@...nel.org, jack@...e.cz, chandan.babu@...cle.com,
willy@...radead.org, axboe@...nel.dk, martin.petersen@...cle.com,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
tytso@....edu, jbongio@...gle.com, ojaswin@...ux.ibm.com,
ritesh.list@...il.com, mcgrof@...nel.org, p.raghav@...sung.com,
linux-xfs@...r.kernel.org, catherine.hoang@...cle.com
Subject: Re: [PATCH v3 08/21] xfs: Introduce FORCEALIGN inode flag
On Wed, May 01, 2024 at 11:03:06AM +0100, John Garry wrote:
>
> > > +/* Validate the forcealign inode flag */
> > > +xfs_failaddr_t
> > > +xfs_inode_validate_forcealign(
> > > + struct xfs_mount *mp,
> > > + uint16_t mode,
> >
> > umode_t mode,
>
> ok. BTW, other functions like xfs_inode_validate_extsize() use uint16_t
>
> >
> > > + uint16_t flags,
> > > + uint32_t extsize,
> > > + uint32_t cowextsize)
> >
> > extent sizes are xfs_extlen_t types.
>
> ok
>
> >
> > > +{
> > > + /* superblock rocompat feature flag */
> > > + if (!xfs_has_forcealign(mp))
> > > + return __this_address;
> > > +
> > > + /* Only regular files and directories */
> > > + if (!S_ISDIR(mode) && !S_ISREG(mode))
> > > + return __this_address;
> > > +
> > > + /* Doesn't apply to realtime files */
> > > + if (flags & XFS_DIFLAG_REALTIME)
> > > + return __this_address;
> >
> > Why not? A rt device with an extsize of 1 fsb could make use of
> > forced alignment just like the data device to allow larger atomic
> > writes to be done. I mean, just because we haven't written the code
> > to do this yet doesn't mean it is an illegal on-disk format state.
>
> ok, so where is a better place to disallow forcealign for RT now (since we
> have not written the code to support it nor verified it)?
Just don't allow it to be set in the setattr ioctl if the inode is
RT. ANd don't let an inode be marked RT if forcealign is already
set.
>
> >
> > > + /* Requires a non-zero power-of-2 extent size hint */
> > > + if (extsize == 0 || !is_power_of_2(extsize) ||
> > > + (mp->m_sb.sb_agblocks % extsize))
> > > + return __this_address;
> >
> > Please do these as indiviual checks with their own fail address.
>
> ok
>
> > That way we can tell which check failed from the console output.
> > Also, the agblocks check is already split out below, so it's being
> > checked twice...
> >
> > Also, why does force-align require a power-of-2 extent size? Why
> > does it require the extent size to be an exact divisor of the AG
> > size? Aren't these atomic write alignment restrictions? i.e.
> > shouldn't these only be enforced when the atomic writes inode flag
> > is set?
>
> With regards the power-of-2 restriction, I think that the code changes are
> going to become a lot more complex if we don't enforce this for forcealign.
>
> For example, consider xfs_file_dio_write(), where we check for an unaligned
> write based on forcealign extent mask. It's much simpler to rely on a
> power-of-2 size. And same for iomap extent zeroing.
But it's not more complex - we already do this non-power-of-2
alignment stuff for all the realtime code, so it's just a matter
of not blindly using bit masking in alignment checks.
> So then it can be asked, for what reason do we want to support unorthodox,
> non-power-of-2 sizes? Who would want this?
I'm constantly surprised by the way people use stuff like this
filesystem and storage alignment constraints are not arbitrarily
limited to power-of-2 sizes.
For example, code implementation is simple in RAID setups when you
use power-of-2 chunk sizes and stripe widths. But not all storage
hardware fits power-of-2 configs like 4+1, 4+2, 8+1, 8+2, etc. THis
is pretty common - 2.5" 2U drive trays have 24 drive bays. If you
want to give up 33% of the storage capacity just to use power-of-2
stripe widths then you would use 4x4+2 RAID6 luns. However, most
people don't want to waste that much money on redundancy. They are
much more likely to use 2x10+2 RAID6 luns or 1x21+2 with a hot spare
to maximise the data storage capacity.
If someone wants to force-align allocation to stripe widths on such
a RAID array config rather than trying to rely on the best effort
swalloc mount option, then they need non-power-of-2
alignments to be supported.
It's pretty much a no-brainer - the alignment code already handles
non-power-of-2 alignments, and it's not very much additional code to
ensure we can handle any alignment the user specified.
> As for AG size, again I think that it is required to be aligned to the
> forcealign extsize. As I remember, when converting from an FSB to a DB, if
> the AG itself is not aligned to the forcealign extsize, then the DB will not
> be aligned to the forcealign extsize. More below...
>
> >
> > > + /* Requires agsize be a multiple of extsize */
> > > + if (mp->m_sb.sb_agblocks % extsize)
> > > + return __this_address;
> > > +
> > > + /* Requires stripe unit+width (if set) be a multiple of extsize */
> > > + if ((mp->m_dalign && (mp->m_dalign % extsize)) ||
> > > + (mp->m_swidth && (mp->m_swidth % extsize)))
> > > + return __this_address;
> >
> > Again, this is an atomic write constraint, isn't it?
>
> So why do we want forcealign? It is to only align extent FSBs?
Yes. forced alignment is essentially just extent size guarantees.
This is part of what is needed for atomic writes, but atomic writes
also require specific physical storage alignment between the
filesystem and the device. The filesystem setup has to correctly
align AGs to the physical storage, and stuff like RAID
configurations need to be specifically compatible with the atomic
write capabilities of the underlying hardware.
None of these hardware iand storage stack alignment constraints have
any relevance to the filesystem forced alignment functionality. They
are completely indepedent. All the forced alignment does is
guarantees that allocation is aligned according the extent size hint
on the inode or it fails with ENOSPC.
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index d0e2cec6210d..d1126509ceb9 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
> > > di_flags2 |= XFS_DIFLAG2_DAX;
> > > if (xflags & FS_XFLAG_COWEXTSIZE)
> > > di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > > + if (xflags & FS_XFLAG_FORCEALIGN)
> > > + di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
> > > return di_flags2;
> > > }
> > > @@ -1146,6 +1148,22 @@ xfs_ioctl_setattr_xflags(
> > > if (i_flags2 && !xfs_has_v3inodes(mp))
> > > return -EINVAL;
> > > + /*
> > > + * Force-align requires a nonzero extent size hint and a zero cow
> > > + * extent size hint. It doesn't apply to realtime files.
> > > + */
> > > + if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
> > > + if (!xfs_has_forcealign(mp))
> > > + return -EINVAL;
> > > + if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
> > > + return -EINVAL;
> > > + if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
> > > + FS_XFLAG_EXTSZINHERIT)))
> > > + return -EINVAL;
> > > + if (fa->fsx_xflags & FS_XFLAG_REALTIME)
> > > + return -EINVAL;
> > > + }
> >
> > What about if the file already has shared extents on it (i.e.
> > reflinked or deduped?)
>
> At the top of the function we have this check for RT:
>
> if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> /* Can't change realtime flag if any extents are allocated. */
> if (ip->i_df.if_nextents || ip->i_delayed_blks)
> return -EINVAL;
> }
>
> Would expanding that check for forcealign also suffice? Indeed, later in
> this series I expanded this check to cover atomicwrites (when I really
> intended it for forcealign).
For the moment, yes.
> > > @@ -1263,7 +1283,19 @@ xfs_ioctl_setattr_check_extsize(
> > > failaddr = xfs_inode_validate_extsize(ip->i_mount,
> > > XFS_B_TO_FSB(mp, fa->fsx_extsize),
> > > VFS_I(ip)->i_mode, new_diflags);
> > > - return failaddr != NULL ? -EINVAL : 0;
> > > + if (failaddr)
> > > + return -EINVAL;
> > > +
> > > + if (new_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
> > > + failaddr = xfs_inode_validate_forcealign(ip->i_mount,
> > > + VFS_I(ip)->i_mode, new_diflags,
> > > + XFS_B_TO_FSB(mp, fa->fsx_extsize),
> > > + XFS_B_TO_FSB(mp, fa->fsx_cowextsize));
> > > + if (failaddr)
> > > + return -EINVAL;
> > > + }
> >
> > Oh, it's because you're trying to use on-disk format validation
> > routines for user API validation. That, IMO, is a bad idea because
> > the on-disk format and kernel/user APIs should not be tied
> > together as they have different constraints and error conditions.
> >
> > That also explains why xfs_inode_validate_forcealign() doesn't just
> > get passed the inode to validate - it's because you want to pass
> > information from the user API to it. This results in sub-optimal
> > code for both on-disk format validation and user API validation.
> >
> > Can you please separate these and put all the force align user API
> > validation checks in the one function?
> >
>
> ok, fine. But it would be good to have clarification on function of
> forcealign, above, i.e. does it always align extents to disk blocks?
No, it doesn't. XFS has never done this - physical extent alignment
is always done relative to the start of the AG, not the underlying
disk geometry.
IOWs, forced alignement is not aligning to disk blocks at all - it
is aligning extents logically to file offset and physically to the
offset from the start of the allocation group. Hence there are no
real constraints on forced alignment - we can do any sort of
alignment as long it is smaller than half the max size of a physical
extent.
For allocation to then be aligned to physical storage, we need mkfs
to physically align the start of each AG to the geometry of the
underlying storage. We already do this for filesystems with a stripe
unit defined, hence stripe aligned allocation is physically aligned
to the underlying storage.
However, if mkfs doesn't get the physical layout of AGs right, there
is nothing the mounted filesystem can do to guarantee extent
allocation is aligned to physical disk blocks regardless of whether
forced alignment is enabled or not...
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists