[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eldlhdvhc4sdlmfed5omg6huv5rl6m7ummstlygh2bownaejqn@bykrybkyywzp>
Date: Wed, 4 Feb 2026 15:23:46 +0100
From: Jan Kara <jack@...e.cz>
To: Baokun Li <libaokun1@...wei.com>
Cc: Theodore Tso <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>,
Christoph Hellwig <hch@...radead.org>, linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, adilger.kernel@...ger.ca, jack@...e.cz, ojaswin@...ux.ibm.com,
ritesh.list@...il.com, djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>,
yizhang089@...il.com, yangerkun@...wei.com,
yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com, libaokun9@...il.com
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
buffered I/O path
On Wed 04-02-26 09:59:36, Baokun Li wrote:
> On 2026-02-03 21:14, Theodore Tso wrote:
> > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >> This means that the ordered journal mode is no longer in ext4 used
> >> under the iomap infrastructure. The main reason is that iomap
> >> processes each folio one by one during writeback. It first holds the
> >> folio lock and then starts a transaction to create the block mapping.
> >> If we still use the ordered mode, we need to perform writeback in
> >> the logging process, which may require initiating a new transaction,
> >> potentially leading to deadlock issues. In addition, ordered journal
> >> mode indeed has many synchronization dependencies, which increase
> >> the risk of deadlocks, and I believe this is one of the reasons why
> >> ext4_do_writepages() is implemented in such a complicated manner.
> >> Therefore, I think we need to give up using the ordered data mode.
> >>
> >> Currently, there are three scenarios where the ordered mode is used:
> >> 1) append write,
> >> 2) partial block truncate down, and
> >> 3) online defragmentation.
> >>
> >> For append write, we can always allocate unwritten blocks to avoid
> >> using the ordered journal mode.
> > This is going to be a pretty severe performance regression, since it
> > means that we will be doubling the journal load for append writes.
> > What we really need to do here is to first write out the data blocks,
> > and then only start the transaction handle to modify the data blocks
> > *after* the data blocks have been written (to heretofore, unused
> > blocks that were just allocated). It means inverting the order in
> > which we write data blocks for the append write case, and in fact it
> > will improve fsync() performance since we won't be gating writing the
> > commit block on the date blocks getting written out in the append
> > write case.
>
> I have some local demo patches doing something similar, and I think this
> work could be decoupled from Yi's patch set.
>
> Since inode preallocation (PA) maintains physical block occupancy with a
> logical-to-physical mapping, and ensures on-disk data consistency after
> power failure, it is an excellent location for recording temporary
> occupancy. Furthermore, since inode PA often allocates more blocks than
> requested, it can also help reduce file fragmentation.
>
> The specific approach is as follows:
>
> 1. Allocate only the PA during block allocation without inserting it into
> the extent status tree. Return the PA to the caller and increment its
> refcount to prevent it from being discarded.
>
> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
> refcount and return -EIO. If successful, proceed to the next step.
>
> 3. Start a handle upon successful IO completion to convert the inode PA to
> extents. Release the refcount and update the extent tree.
>
> 4. If a corresponding extent already exists, we’ll need to punch holes to
> release the old extent before inserting the new one.
Sounds good. Just if I understand correctly case 4 would happen only if you
really try to do something like COW with this? Normally you'd just use the
already present blocks and write contents into them?
> This ensures data atomicity, while jbd2—being a COW-like implementation
> itself—ensures metadata atomicity. By leveraging this "delay map"
> mechanism, we can achieve several benefits:
>
> * Lightweight, high-performance COW.
> * High-performance software atomic writes (hardware-independent).
> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
> * Replacing ordered data and data journal modes.
> * Reduced handle hold time, as it's only held during extent tree updates.
> * Paving the way for snapshot support.
>
> Of course, COW itself can lead to severe file fragmentation, especially
> in small-scale overwrite scenarios.
I agree the feature can provide very interesting benefits and we were
pondering about something like that for a long time, just never got to
implementing it. I'd say the immediate benefits are you can completely get
rid of dioread_nolock as well as the legacy dioread_lock modes so overall
code complexity should not increase much. We could also mostly get rid of
data=ordered mode use (although not completely - see my discussion with
Zhang over patch 3) which would be also welcome simplification. These
benefits alone are IMO a good enough reason to have the functionality :).
Even without COW, atomic writes and other fancy stuff.
I don't see how you want to get rid of data=journal mode - perhaps that's
related to the COW functionality?
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists