linux-kernel - Re: [PATCH v2 07/13] xfs: Introduce FORCEALIGN inode flag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240723222603.GS612460@frogsfrogsfrogs>
Date: Tue, 23 Jul 2024 15:26:03 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: John Garry <john.g.garry@...cle.com>
Cc: Christoph Hellwig <hch@....de>, Dave Chinner <david@...morbit.com>,
	chandan.babu@...cle.com, dchinner@...hat.com,
	viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
	linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, catherine.hoang@...cle.com,
	martin.petersen@...cle.com, Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v2 07/13] xfs: Introduce FORCEALIGN inode flag

On Tue, Jul 23, 2024 at 04:01:41PM +0100, John Garry wrote:
> On 23/07/2024 15:42, Christoph Hellwig wrote:
> > On Tue, Jul 23, 2024 at 11:11:28AM +0100, John Garry wrote:
> > > I am looking at something like this to implement read-only for those inodes:
> > 
> > Yikes.  Treating individual inodes in a file systems as read-only
> > is about the most confusing and harmful behavior we could do.
> 
> That was the suggestion which I was given earlier in this thread.

Well, Christoph and I suggested failing the mount /earlier/ in this
thread. ;)

> > 
> > Just treat it as any other rocompat feature please an mount the entire
> > file system read-only if not supported.
> > 
> > Or even better let this wait a little, and work with Darrick to work
> > on the rextsize > 1 reflіnk patches and just make the thing work.
> 
> I'll let Darrick comment on this.

COW with alloc_unit > fsblock is not currently possible, whether it's
forcealign or rtreflink because COW must happen at allocation unit
granularity.  Pure overwrites don't need all these twists and turns.

1. For COW to work, each write/page_mkwrite must mark dirty every
fsblock in the entire alloc unit.  Those fsblocks could be cached by
multiple folios, which means (in iomap terms) dirtying each block in
potentially multiple iomap_folio_state structures, as well as their
folios.

2. Similarly, writeback must then be able to issue IO in quantities that
are aligned to allocation units.  IOWs, for every dirty region in the
file, we'd have to find the folios for a given allocation unit, mark
them all for writeback, and issue bios for however much we managed to
do.  If it's not possible to grab a folio, then the entire allocation
unit can't be written out, which implies that writeback can fail to
fully clean folios.

3. Alternately I suppose we could track the number of folios undergoing
writeback for each allocation unit, issue the writeback ios whenever
we're ready, and only remap the allocation unit when the number of
folios undergoing writeback for that allocation unit reaches zero.

If we could get the mapping_set_folio_order patch merged, then we could
at least get partial support for power-of-two alloc_unit > fsblock
configurations by setting the minimum folio order to log2(alloc_unit).
For atomic writes this is probably a hard requirement because we must be
able to submit one bio with one memory region.

For everyone else this sucks because cranking up the min folio order
reduces the flexibility that the page cache can have in finding cache
memory... but until someone figures out how to make the batching work,
there's not much progress to be made.

For non power-of-two alloc_unit we can't just crank up the min folio
order because there will always be misalignments somewhere; we need a
full writeback batching implementation that can handle multiple folios
per alloc unit and partial folio writeback.

djwong-dev implements 1.  It partially handles 2 by enlarging the wbc
range to be aligned to allocation units, but it doesn't guarantee that
all the folios actually got tagged for the batch.  It can't do 3, which
means that it's probably broken if you press it hard enough.

Alternately we could disallow non power-of-two everywhere, which would
make the accounting simpler but that's a regression against ye olde xfs
which supports non power-of-two allocation units.

rtreflink is nowhere near ready to go -- it's still in djwong-wtf behind
metadata directories, rtgroups, realtime rmap, and (probably) hch's
zns patches.

> > > > So what about forcealign and RT?
> > > 
> > > Any opinion on this?
> > 
> > What about forcealign and RT?
> 
> In this series version I was mounting the whole FS as RO if
> XFS_FEAT_FORCEALIGN and XFS_FEAT_REFLINK was found in the SB. And so very
> different to how I was going to individual treat inodes which happen to be
> forcealign and reflink, above.
> 
> So I was asking guidance when whether that approach (for RT and forcealign)
> is sound.

I reiterate: don't allow mounting of (forcealign && reflink) or
(forcealign && rtextsize > 1) filesystems, and then you and I can work
on figuring out the rest.

--D