linux-kernel - Re: [PATCH v4 00/14] forcealign for xfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZuoSDVJPOX44Fww/@dread.disaster.area>
Date: Wed, 18 Sep 2024 09:34:37 +1000
From: Dave Chinner <david@...morbit.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: John Garry <john.g.garry@...cle.com>,
	Ritesh Harjani <ritesh.list@...il.com>, chandan.babu@...cle.com,
	dchinner@...hat.com, hch@....de, viro@...iv.linux.org.uk,
	brauner@...nel.org, jack@...e.cz, linux-xfs@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	catherine.hoang@...cle.com, martin.petersen@...cle.com
Subject: Re: [PATCH v4 00/14] forcealign for xfs

On Tue, Sep 17, 2024 at 01:54:20PM -0700, Darrick J. Wong wrote:
> On Mon, Sep 16, 2024 at 11:24:56AM +0100, John Garry wrote:
> > On 16/09/2024 08:03, Dave Chinner wrote:
> > > OTOH, we can't do this with atomic writes. Atomic writes require
> > > some mkfs help because they require explicit physical alignment of
> > > the filesystem to the underlying storage.
> 
> Forcealign requires agsize%extsize==0,

No it doesn't.

AG size is irrelevant when aligning extents - all
that matters is that we can find a free extent that can be trimmed
to the alignment defined by extsize. 

> it's atomicwrites that adds the
> requirement that extsize be a power of 2...

Only by explicit implementation constraint.

Atomic writes do not require power of two extsize - they only
require correctly aligned physical extents.

e.g an 8kB atomic write is always guaranteed to succeed if
we have an extsize of 24kB for laying out the physical extents
because a 24kB physical extent is always 8kB aligned and is an exact
multiple of 8kB. This meets the requirements for 8kB atomic writes to
always succeed, and hence there is no fundamental requirement for
extsize to be a power of 2.

We have *chosen* to simplify the implementation by only allowing
a single aligned atomic write to be issued at a time. This means
the alignment and size of atomic writes is always the minimum size
the hardware advertises, and that is (at present) always a power of
2.

Hence the "extsize needs to be a power of 2" comes from the
constraints exposed from the block device configuration (i.e.
minimum atomic write unit), not from a filesystem design or
implementation constraint.

At the filesystem level, we have further simplified things by only
allowing extsize = atomic write size. Hence the initial
implementation ends up only support power of 2 extsize values. This
is not a hard design or implementation constraint, however.

.....

hmmmmm.

.....

In writing this I've had further thoughts on force-align and the
sub-alloc-unit unwritten extent issue we've been discussing here.
i.e. I've stopped and considered the existing design constraints
given what I wrote above and considered what is needed for
supporting large extsize for small atomic writes.

I think we need to support large extsize with small atomic write
size for two reasons:

1. extsize is still going to be needed for preventing excessive
fragmentation with atomic writes. It's the small DIO write workloads
that see lots of fragmentation, and applications using atomic writes
are essentially being forced down the path of being small DIO write
workloads.

2. we can allow force-align w/o atomic writes behaviour to match the
existing rtvol sb_rextsize > 1 fsb behaviour without impacting
atomic write behaviour. (i.e. less behavioural differences, more
common code, better performance, etc).

To do this (and I think we do want to do this), then we have to do
two things:

1. force-align needs to add a "unwritten align" inode parameter to
allow sub-extsize unwritten extent boundaries to exist in the BMBT.
(i.e.  similar to how rt files w/ sb_rextsize > 1 fsb currently
behave.)

This is purely an in-memory value - for pure "force-align" we can
set it 1 fsb and then the behaviour will match existing RT
behaviour.  We can abstract this behaviour by replacing the hard
coded 1 block alignment for unwritten conversion with an "unwritten
align" value which would initially be set to 1.

We can also abstract this code away from being "rt specific" and
make it entirely dependent on "alloc-unit" configuration. This means
rt, force-align and atomic write will all be running the same code,
which makes testing a lot easier..

2. inodes with atomic write enabled must set the unwritten align
value to the atomic write size exposed by the hardware, and the
extsize must be an exact integer multiple of the unwritten align
size.

The initial settings of unwritten align == extsize gives the current
behaviour of all allocation and extent conversion being aligned to
atomic write constraints.

The separation of unwritten conversion from the extsize then allows
allows the situation I described above with 8kB atomic writes
and 24kB extsize. Because unwritten conversion is aligned to atomic
wriet boundaries, we can use sub-alloc-unit unwritten extents
without violating atomic write boundaries.

This would allow us to use extsize for atomic writes in the same
manner we use it for now - enable large contiguous allocations to
prevent fragmentation when doing lots of concurrent small
"immediate write" operations across many files.

I think this can all be added on top of the existing patchset - it's
not really a fundamental change to any of it. It's a little bit more
abstraction and unification, but it enables a lot more flexibility
for optimising atomic write functionality in the future.

Thoughts?

-Dave.
-- 
Dave Chinner
david@...morbit.com