linux-kernel - Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <68adae58-459e-488a-951c-127cc472f123@oracle.com>
Date: Thu, 13 Mar 2025 06:11:12 +0000
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>, Dave Chinner <david@...morbit.com>
Cc: Christoph Hellwig <hch@...radead.org>, brauner@...nel.org, cem@...nel.org,
        linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, ojaswin@...ux.ibm.com,
        ritesh.list@...il.com, martin.petersen@...cle.com
Subject: Re: [PATCH v5 03/10] xfs: Refactor xfs_reflink_end_cow_extent()

On 13/03/2025 04:51, Darrick J. Wong wrote:
>> Hence if we are walking a range of extents in the BMBT to unmap
>> them, then we should only be generating 2 intents per loop - a BUI
>> for the BMBT removal and a CUI for the shared refcount decrease.
>> That means we should be able to run at least a thousand iterations
>> of that loop per transaction without getting anywhere near the
>> transaction reservation limits.
>>
>> *However!*
>>
>> We have to relog every intent we haven't processed in the deferred
>> batch every-so-often to prevent the outstanding intents from pinning
>> the tail of the log. Hence the larger the number of intents in the
>> initial batch, the more work we have to do later on (and the more
>> overall log space and bandwidth they will consume) to relog them
>> them over and over again until they pop to the head of the
>> processing queue.
>>
>> Hence there is no real perforamce advantage to creating massive intent
>> batches because we end up doing more work later on to relog those
>> intents to prevent journal space deadlocks. It also doesn't speed up
>> processing, because we still process the intent chains one at a time
>> from start to completion before moving on to the next high level
>> intent chain that needs to be processed.
>>
>> Further, after the first couple of intent chains have been
>> processed, the initial log space reservation will have run out, and
>> we are now asking for a new resrevation on every transaction roll we
>> do. i.e. we now are now doing a log space reservation on every
>> transaction roll in the processing chain instead of only doing it
>> once per high level intent chain.
>>
>> Hence from a log space accounting perspective (the hottest code path
>> in the journal), it is far more efficient to perform a single high
>> level transaction per extent unmap operation than it is to batch
>> intents into a single high level transaction.
>>
>> My advice is this: we should never batch high level iterative
>> intent-based operations into a single transaction because it's a
>> false optimisation.  It might look like it is an efficiency
>> improvement from the high level, but it ends up hammering the hot,
>> performance critical paths in the transaction subsystem much, much
>> harder and so will end up being slower than the single transaction
>> per intent-based operation algorithm when it matters most....
> How specifically do you propose remapping all the extents in a file
> range after an untorn write?  The regular cow ioend does a single
> transaction per extent across the entire ioend range and cannot deliver
> untorn writes.  This latest proposal does, but now you've torn that idea
> down too.
> 
> At this point I have run out of ideas and conclude that can only submit
> to your superior intellect.
> 
> --D

I'm hearing that we can fit thousands without getting anywhere the 
limits - this is good.

But then also it is not optimal in terms of performance to batch, right? 
Performance is not so important here. This is for a software fallback, 
which we should not frequently hit. And even if we do, we're still 
typically not going to have many extents.

For our specific purpose, we want 16KB atomic writes - that is max of 4 
extents. So this does not really sound like something to be concerned 
with for these atomic write sizes.

We can add some arbitrary FS awu max, like 64KB, if that makes people 
feel more comfortable.

Thanks,
John