linux-kernel - Re: [PATCH v3 09/12] ext4: introduce mext_move

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <021b2564-4f1d-4dd7-b98c-569668c8359a@huaweicloud.com>
Date: Sat, 11 Oct 2025 09:20:33 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 yi.zhang@...wei.com, libaokun1@...wei.com, yukuai3@...wei.com,
 yangerkun@...wei.com
Subject: Re: [PATCH v3 09/12] ext4: introduce mext_move_extent()

On 10/10/2025 9:38 PM, Jan Kara wrote:
> On Fri 10-10-25 18:33:23, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> When moving extents, the current move_extent_per_page() process can only
>> move extents of length PAGE_SIZE at a time, which is highly inefficient,
>> especially when the fragmentation of the file is not particularly
>> severe, this will result in a large number of unnecessary extent split
>> and merge operations. Moreover, since the ext4 file system now supports
>> large folios, using PAGE_SIZE as the processing unit is no longer
>> practical.
>>
>> Therefore, introduce a new move extents method, mext_move_extent(). It
>> moves one extent of the origin inode at a time, but not exceeding the
>> size of a folio. The parameters for the move are passed through the new
>> mext_data data structure, which includes the origin inode, donor inode,
>> the mapping extent of the origin inode to be moved, and the starting
>> offset of the donor inode.
>>
>> The move process is similar to move_extent_per_page() and can be
>> categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and
>> MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area
>> of the donor file is a hole, meaning no actual space is allocated, so
>> the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding
>> areas of both the origin and donor files are unwritten, so no data needs
>> to be copied; only the extents are swapped. MEXT_COPY_DATA indicates
>> that the corresponding areas of both the origin and donor files contain
>> data, so data must be copied. The data copying is performed in three
>> steps: first, the data from the original location is read into the page
>> cache; then, the extents are swapped, and the page cache is rebuilt to
>> reflect the index of the physical blocks; finally, the dirty page cache
>> is marked and written back to ensure that the data is written to disk
>> before the metadata is persisted.
>>
>> One important point to note is that the folio lock and i_data_sem are
>> held only during the moving process. Therefore, before moving an extent,
>> it is necessary to check whether the sequence cookie of the area to be
>> moved has changed while holding the folio lock. If a change is detected,
>> it indicates that concurrent write-back operations may have occurred
>> during this period, and the type of the extent to be moved can no longer
>> be considered reliable. For example, it may have changed from unwritten
>> to written. In such cases, return -ESTALE, and the calling function
>> should reacquire the move extent of the original file and retry the
>> movement.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> 
> ...
> 
>> +static __used int mext_move_extent(struct mext_data *mext, u64 *m_len)
>> +{
>> +	struct inode *orig_inode = mext->orig_inode;
>> +	struct inode *donor_inode = mext->donor_inode;
>> +	struct ext4_map_blocks *orig_map = &mext->orig_map;
>> +	unsigned int blkbits = orig_inode->i_blkbits;
>> +	struct folio *folio[2] = {NULL, NULL};
>> +	loff_t from, length;
>> +	enum mext_move_type move_type = 0;
>> +	handle_t *handle;
>> +	u64 r_len = 0;
>> +	unsigned int credits;
>> +	int ret, ret2;
>> +
>> +	*m_len = 0;
>> +	credits = ext4_chunk_trans_extent(orig_inode, 0) * 2;
>> +	handle = ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, credits);
>> +	if (IS_ERR(handle))
>> +		return PTR_ERR(handle);
>> +
>> +	ret = mext_move_begin(mext, folio, &move_type);
>> +	if (ret)
>> +		goto stop_handle;
>> +
>> +	if (move_type == MEXT_SKIP_EXTENT)
>> +		goto unlock;
>> +
>> +	/*
>> +	 * Copy the data. First, read the original inode data into the page
>> +	 * cache. Then, release the existing mapping relationships and swap
>> +	 * the extent. Finally, re-establish the new mapping relationships
>> +	 * and dirty the page cache.
>> +	 */
>> +	if (move_type == MEXT_COPY_DATA) {
>> +		from = offset_in_folio(folio[0],
>> +				((loff_t)orig_map->m_lblk) << blkbits);
>> +		length = ((loff_t)orig_map->m_len) << blkbits;
>> +
>> +		ret = mext_folio_mkuptodate(folio[0], from, from + length);
>> +		if (ret)
>> +			goto unlock;
>> +	}
>> +
>> +	if (!filemap_release_folio(folio[0], 0) ||
>> +	    !filemap_release_folio(folio[1], 0)) {
>> +		ret = -EBUSY;
>> +		goto unlock;
>> +	}
>> +
>> +	/* Move extent */
>> +	ext4_double_down_write_data_sem(orig_inode, donor_inode);
>> +	*m_len = ext4_swap_extents(handle, orig_inode, donor_inode,
>> +				   orig_map->m_lblk, mext->donor_lblk,
>> +				   orig_map->m_len, 1, &ret);
>> +	ext4_double_up_write_data_sem(orig_inode, donor_inode);
>> +
>> +	/* A short-length swap cannot occur after a successful swap extent. */
>> +	if (WARN_ON_ONCE(!ret && (*m_len != orig_map->m_len)))
>> +		ret = -EIO;
>> +
>> +	if (!(*m_len) || (move_type == MEXT_MOVE_EXTENT))
>> +		goto unlock;
>> +
>> +	/* Copy data */
>> +	length = (*m_len) << blkbits;
>> +	ret = mext_folio_mkwrite(orig_inode, folio[0], from, from + length);
>> +	if (ret)
>> +		goto repair_branches;
> 
> I think you need to be careful here and below to not overwrite 'ret' if it
> is != 0. So something like:
> 
> 	ret2 = mext_folio_mkwrite(..)
> 	if (ret2) {
> 		if (!ret)
> 			ret = ret2;
> 		goto repair_branches;
> 	}
> 
> and something similar below. Otherwise the patch looks good to me.
> 
> 								Honza

OK, although overwrite 'ret' seems fine, it's better to keep it.

Thanks,
Yi.