linux-ext4 - Re: [RFC 0/5] ext4: Implement support for extsize hints

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZuqjU0KcCptQKrFs@dread.disaster.area>
Date: Wed, 18 Sep 2024 19:54:27 +1000
From: Dave Chinner <david@...morbit.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>,
	Ritesh Harjani <ritesh.list@...il.com>,
	linux-kernel@...r.kernel.org,
	"Darrick J . Wong" <djwong@...nel.org>,
	linux-fsdevel@...r.kernel.org, John Garry <john.g.garry@...cle.com>,
	dchinner@...hat.com
Subject: Re: [RFC 0/5] ext4: Implement support for extsize hints

On Wed, Sep 11, 2024 at 02:31:04PM +0530, Ojaswin Mujoo wrote:
> This patchset implements extsize hint feature for ext4. Posting this RFC to get
> some early review comments on the design and implementation bits. This feature
> is similar to what we have in XFS too with some differences.
> 
> extsize on ext4 is a hint to mballoc (multi-block allocator) and extent
> handling layer to do aligned allocations. We use allocation criteria 0
> (CR_POWER2_ALIGNED) for doing aligned power-of-2 allocations. With extsize hint
> we try to align the logical start (m_lblk) and length(m_len) of the allocation
> to be extsize aligned. CR_POWER2_ALIGNED criteria in mballoc automatically make
> sure that we get the aligned physical start (m_pblk) as well. So in this way
> extsize can make sure that lblk, len and pblk all are aligned for the allocated
> extent w.r.t extsize.
> 
> Note that extsize feature is just a hinting mechanism to ext4 multi-block
> allocator. That means that if we are unable to get an aligned allocation for
> some reason, than we drop this flag and continue with unaligned allocation to
> serve the request. However when we will add atomic/untorn writes support, then
> we will enforce the aligned allocation and can return -ENOSPC if aligned
> allocation was not successful.
> 
> Comparison with XFS extsize feature -
> =====================================
> 1. extsize in XFS is a hint for aligning only the logical start and the lengh
>    of the allocation v/s extsize on ext4 make sure the physical start of the
>    extent gets aligned as well.

What happens when you can't align the physical start of the extent?
It fails the allocation with ENOSPC?

For XFS, the existing extent size behaviour is a hint, and so we
ignore the hint if we cannot perform the allocation with the
suggested alignment. i.e. We should not fail an allocation with an
extent size hint until we are actually very near ENOSPC.

With the new force-align feature, the physical alignment within an
AG gets aligned to the extent size. In this case, if we can't find
an aligned free extent to allocate, we fail the allocation (ENOSPC).
Hence with forced alignment, we can have ENOSPC occur when there are
large amounts of free space available in the filesystem.

This is almost certainly what most people -don't want-, but it is a
requirement for atomic writes. To make matters worse, this behaviour
will almost certainly get worst as filesystem ages and free space
slowly fragments over time.

IOWs, by making the ext4 extsize have forced alignment semantics by
default, it means users will see ENOSPC at lot more frequently and
in situations where it is most definitely not expected.

We also have to keep in mind that there are applications out there
that set and use extent size hints, and so enabling extsize in ext4
will result in those applications silently starting to use them. If
ext4 supporting extsize hints drastically changes the behaviour of
the filesystem then that is going to cause significant unexpected
regressions for users as they upgrade kernels and filesystems.

Hence I strongly suggest that ext4 implements extent size hints in
the same way that XFS does. i.e. unless forced alignment has been
enabled for the inode, extsize is just a hint that gets discarded if
aligned allocation does not succeed.

Behaviour such as extent size hinting *should* be the same across
all filesystems that provide this functionality.  This makes using
extent size hints much easier for users, admins and application
developers. The last thing I want to hear is application devs tell
me at conferences that "we don't use extent size hints anymore
because ext4..."

> 2. eof allocation on XFS trims the blocks allocated beyond eof with extsize
>    hint. That means on XFS for eof allocations (with extsize hint) only logical
>    start gets aligned.

I'm not sure I understand what you are saying here. XFS does extsize
alignment of both the start and end of post-eof extents the same as
it does for extents within EOF. For example:

# xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "bmap -vvp" foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0308 sec (129.815 KiB/sec and 32.4538 ops/sec)
foo:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          256504..256511    0 (256504..256511)     8 000000
   1: [8..31]:         256512..256535    0 (256512..256535)    24 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent

There's a 4k written extent at 0, and a 12k unwritten extent
beyond EOF at 4k. I.e. we have an extent of 16kB as the hint
required that is correctly aligned beyond EOF.

If I then write another 4k at 20k (beyond both EOF and the unwritten
extent beyond EOF:

# xfs_io -fdc "truncate 0" -c "extsize 16k" -c "pwrite 0 4k" -c "pwrite 20k 4k" -c "bmap -vvp" foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0210 sec (190.195 KiB/sec and 47.5489 ops/sec)
wrote 4096/4096 bytes at offset 20480
4 KiB, 1 ops; 0.0001 sec (21.701 MiB/sec and 5555.5556 ops/sec)
foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          180000..180007    0 (180000..180007)     8 000000
   1: [8..39]:         180008..180039    0 (180008..180039)    32 010000
   2: [40..47]:        180040..180047    0 (180040..180047)     8 000000
   3: [48..63]:        180048..180063    0 (180048..180063)    16 010000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent

You can see we did contiguous allocation of another 16kB at offset
16kB, and then wrote to 20k for 4kB.. i.e. the new extent was
correctly aligned at both sides as the extsize hint says it should
be....

>    However extsize hint in ext4 for eof allocation is not
>    supported in this version of the series.

If you can't do extsize aligned allocations for EOF extension, then
how to applications use atomic writes to atomically extend the file?

> 3. XFS allows extsize to be set on file with no extents but delayed data.

It does?

<looks>

Yep, it doesn't check ip->i_delayed_blks is zero when changing
extsize.

I think that's simply a bug, not intended behaviour, because
delalloc will not have reserved space for the extsize hint rounding
needed when writeback occurs. Can you send a patch to add this
check?

>    However, ext4 don't allow that for simplicity. The user is expected to set
>    it on a file before changing it's i_size.

We don't actually care about i_size in XFS - the determining factor
is whether there are extents allocated on disk. i.e. we can truncate
up and then set the extent size hint because there are no extents
allocated even though the size is non-zero. 

There are almost certainly applications out there that change extent
size after truncating to a non-zero size, so this needs to work on
ext4 the same way it does on XFS. Otherwise people are going to
complain that their applications suddenly stop working properly on
ext4....

> 4. XFS allows non-power-of-2 values for extsize but ext4 does not, since we
>    primarily would like to support atomic writes with extsize.

Yes, ext4 can make that restriction if desired.

Keep in mind that the XFS atomic write support is still evolving,
and I think the way we are using extent size hints isn't fully
solidified yet.

Indeed, I think that we can allow non-power-of-2 extent sizes for
atomic writes, because integer multiples of the atomic write unit
will still ensure that physical extents are properly aligned for
atomic writes to succeed.  e.g. 24kB extent size is compatible with
8kB atomic write sizes.

To make that work efficiently unwritten extent boundaries need to be
maintained at atomic write alignments (8kB), not extent size
alignment (24kB), but other than that I don't think anything else is
needed....

This is desirable because it will allow extent size hints to remain
usable for their original purposes even with atomic writes on XFS.
i.e. fragmentation minimisation for small random DIO write worklaods
(exactly the sort of IO you'd consider using atomic writes for!),
alignment of extents to [non-power-of-2] RAID stripe geometry, etc.

> 5. In ext4 we chose to store the extsize value in SYSTEM_XATTR rather than an
>    inode field as it was simple and most flexible, since there might be more
>    features like atomic/untorn writes coming in future.

Does that mean you can query and set it through the user xattr
interfaces? If so, how do you enforce the values users set are
correct?

> 6. In buffered-io path XFS switches to non-delalloc allocations for extsize hint.
>    The same has been kept for EXT4 as well.

That's an internal XFS implementation detail that you don't need to
replicate. Historically speaking, we didn't use unwritten extents
for delayed allocation and so we couldn't do within-EOF extsize
unaligned writes without adding special additional zero-around code to
ensure that we never exposed stale data to userspace from the extra
allocation that the data write did not cover.

We now use unwritten extents for delalloc conversion, so this istale
data exposure issue no longer exists. We should really switch this
code back to using delalloc because it is much faster and less
fragmentation prone than direct extsize allocation....

-Dave.
-- 
Dave Chinner
david@...morbit.com