lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20250515191210.GR25655@frogsfrogsfrogs> Date: Thu, 15 May 2025 12:12:10 -0700 From: "Darrick J. Wong" <djwong@...nel.org> To: Ritesh Harjani <ritesh.list@...il.com> Cc: linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.cz>, John Garry <john.g.garry@...cle.com>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-fsdevel@...r.kernel.org Subject: Re: [PATCH v4 5/7] ext4: Add multi-fsblock atomic write support with bigalloc On Thu, May 15, 2025 at 11:45:51PM +0530, Ritesh Harjani wrote: > "Darrick J. Wong" <djwong@...nel.org> writes: > > > On Thu, May 15, 2025 at 08:15:37PM +0530, Ritesh Harjani (IBM) wrote: > >> EXT4 supports bigalloc feature which allows the FS to work in size of > >> clusters (group of blocks) rather than individual blocks. This patch > >> adds atomic write support for bigalloc so that systems with bs = ps can > >> also create FS using - > >> mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> > >> > >> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to > >> adjust ext4's atomic write unit max value to cluster size. This can then support > >> atomic write of size anywhere between [blocksize, clustersize]. This > >> patch adds the required changes to enable multi-fsblock atomic write > >> support using bigalloc in the next patch. > >> > >> In this patch for block allocation: > >> we first query the underlying region of the requested range by calling > >> ext4_map_blocks() call. Here are the various cases which we then handle > >> depending upon the underlying mapping type: > >> 1. If the underlying region for the entire requested range is a mapped extent, > >> then we don't call ext4_map_blocks() to allocate anything. We don't need to > >> even start the jbd2 txn in this case. > >> 2. For an append write case, we create a mapped extent. > >> 3. If the underlying region is entirely a hole, then we create an unwritten > >> extent for the requested range. > >> 4. If the underlying region is a large unwritten extent, then we split the > >> extent into 2 unwritten extent of required size. > >> 5. If the underlying region has any type of mixed mapping, then we call > >> ext4_map_blocks() in a loop to zero out the unwritten and the hole regions > >> within the requested range. This then provide a single mapped extent type > >> mapping for the requested range. > >> > >> Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO > >> flag only when the underlying extent mapping of the requested range is > >> not entirely a hole, an unwritten extent, or a fully mapped extent. That > >> is, if the underlying region contains a mix of hole(s), unwritten > >> extent(s), and mapped extent(s), we use this loop to ensure that all the > >> short mappings are zeroed out. This guarantees that the entire requested > >> range becomes a single, uniformly mapped extent. It is ok to do so > >> because we know this is being done on a bigalloc enabled filesystem > >> where the block bitmap represents the entire cluster unit. > >> > >> Note having a single contiguous underlying region of type mapped, > >> unwrittn or hole is not a problem. But the reason to avoid writing on > >> top of mixed mapping region is because, atomic writes requires all or > >> nothing should get written for the userspace pwritev2 request. So if at > >> any point in time during the write if a crash or a sudden poweroff > >> occurs, the region undergoing atomic write should read either complete > >> old data or complete new data. But it should never have a mix of both > >> old and new data. > >> So, we first convert any mixed mapping region to a single contiguous > >> mapped extent before any data gets written to it. This is because > >> normally FS will only convert unwritten extents to written at the end of > >> the write in ->end_io() call. And if we allow the writes over a mixed > >> mapping and if a sudden power off happens in between, we will end up > >> reading mix of new data (over mapped extents) and old data (over > >> unwritten extents), because unwritten to written conversion never went > >> through. > >> So to avoid this and to avoid writes getting torned due to mixed > >> mapping, we first allocate a single contiguous block mapping and then > >> do the write. > >> > >> Co-developed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com> > >> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com> > >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com> > >> --- > >> fs/ext4/ext4.h | 2 + > >> fs/ext4/extents.c | 87 ++++++++++++++++++++ > >> fs/ext4/file.c | 7 +- > >> fs/ext4/inode.c | 200 +++++++++++++++++++++++++++++++++++++++++++++- > >> 4 files changed, 291 insertions(+), 5 deletions(-) > >> > >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > >> index ef6cac6b4b4c..8eb1f332ee7d 100644 > >> --- a/fs/ext4/ext4.h > >> +++ b/fs/ext4/ext4.h > >> @@ -3728,6 +3728,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset, > >> loff_t len); > >> extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode, > >> loff_t offset, ssize_t len); > >> +extern int ext4_convert_unwritten_extents_atomic(handle_t *handle, > >> + struct inode *inode, loff_t offset, ssize_t len); > >> extern int ext4_convert_unwritten_io_end_vec(handle_t *handle, > >> ext4_io_end_t *io_end); > >> extern int ext4_map_blocks(handle_t *handle, struct inode *inode, > >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > >> index fa850f188d46..2967c74dabaf 100644 > >> --- a/fs/ext4/extents.c > >> +++ b/fs/ext4/extents.c > >> @@ -4792,6 +4792,93 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > >> return ret; > >> } > >> > >> +/* > >> + * This function converts a range of blocks to written extents. The caller of > >> + * this function will pass the start offset and the size. all unwritten extents > >> + * within this range will be converted to written extents. > >> + * > >> + * This function is called from the direct IO end io call back function for > >> + * atomic writes, to convert the unwritten extents after IO is completed. > >> + * > >> + * Note that the requirement for atomic writes is that all conversion should > >> + * happen atomically in a single fs journal transaction. We mainly only allocate > >> + * unwritten extents either on a hole on a pre-exiting unwritten extent range in > >> + * ext4_map_blocks_atomic_write(). The only case where we can have multiple > >> + * unwritten extents in a range [offset, offset+len) is when there is a split > >> + * unwritten extent between two leaf nodes which was cached in extent status > >> + * cache during ext4_iomap_alloc() time. That will allow > >> + * ext4_map_blocks_atomic_write() to return the unwritten extent range w/o going > >> + * into the slow path. That means we might need a loop for conversion of this > >> + * unwritten extent split across leaf block within a single journal transaction. > >> + * Split extents across leaf nodes is a rare case, but let's still handle that > >> + * to meet the requirements of multi-fsblock atomic writes. > >> + * > >> + * Returns 0 on success. > >> + */ > >> +int ext4_convert_unwritten_extents_atomic(handle_t *handle, struct inode *inode, > >> + loff_t offset, ssize_t len) > >> +{ > >> + unsigned int max_blocks; > >> + int ret = 0, ret2 = 0, ret3 = 0; > >> + struct ext4_map_blocks map; > >> + unsigned int blkbits = inode->i_blkbits; > >> + unsigned int credits = 0; > >> + int flags = EXT4_GET_BLOCKS_IO_CONVERT_EXT; > >> + > >> + map.m_lblk = offset >> blkbits; > >> + max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits); > >> + > >> + if (!handle) { > >> + /* > >> + * TODO: An optimization can be added later by having an extent > >> + * status flag e.g. EXTENT_STATUS_SPLIT_LEAF. If we query that > >> + * it can tell if the extent in the cache is a split extent. > >> + * But for now let's assume pextents as 2 always. > >> + */ > >> + credits = ext4_meta_trans_blocks(inode, max_blocks, 2); > >> + } > >> + > >> + if (credits) { > >> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits); > >> + if (IS_ERR(handle)) { > >> + ret = PTR_ERR(handle); > >> + return ret; > >> + } > >> + } > >> + > >> + while (ret >= 0 && ret < max_blocks) { > >> + map.m_lblk += ret; > >> + map.m_len = (max_blocks -= ret); > >> + ret = ext4_map_blocks(handle, inode, &map, flags); > >> + if (ret != max_blocks) > >> + ext4_msg(inode->i_sb, KERN_INFO, > >> + "inode #%lu: block %u: len %u: " > >> + "split block mapping found for atomic write, " > >> + "ret = %d", > >> + inode->i_ino, map.m_lblk, > >> + map.m_len, ret); > >> + if (ret <= 0) > >> + break; > >> + } > >> + > >> + ret2 = ext4_mark_inode_dirty(handle, inode); > >> + > >> + if (credits) { > >> + ret3 = ext4_journal_stop(handle); > >> + if (unlikely(ret3)) > >> + ret2 = ret3; > >> + } > >> + > >> + if (ret <= 0 || ret2) > >> + ext4_warning(inode->i_sb, > >> + "inode #%lu: block %u: len %u: " > >> + "returned %d or %d", > >> + inode->i_ino, map.m_lblk, > >> + map.m_len, ret, ret2); > >> + > >> + return ret > 0 ? ret2 : ret; > >> +} > >> + > >> /* > >> * This function convert a range of blocks to written extents > >> * The caller of this function will pass the start offset and the size. > >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c > >> index beb078ee4811..959328072c15 100644 > >> --- a/fs/ext4/file.c > >> +++ b/fs/ext4/file.c > >> @@ -377,7 +377,12 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, > >> loff_t pos = iocb->ki_pos; > >> struct inode *inode = file_inode(iocb->ki_filp); > >> > >> - if (!error && size && flags & IOMAP_DIO_UNWRITTEN) > >> + > >> + if (!error && size && (flags & IOMAP_DIO_UNWRITTEN) && > >> + (iocb->ki_flags & IOCB_ATOMIC)) > >> + error = ext4_convert_unwritten_extents_atomic(NULL, inode, pos, > >> + size); > >> + else if (!error && size && flags & IOMAP_DIO_UNWRITTEN) > >> error = ext4_convert_unwritten_extents(NULL, inode, pos, size); > >> if (error) > >> return error; > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >> index 8b86b1a29bdc..13bc9f07ae7f 100644 > >> --- a/fs/ext4/inode.c > >> +++ b/fs/ext4/inode.c > >> @@ -3412,12 +3412,149 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > >> } > >> } > >> > >> +static int ext4_map_blocks_atomic_write_slow(handle_t *handle, > >> + struct inode *inode, struct ext4_map_blocks *map) > >> +{ > >> + ext4_lblk_t m_lblk = map->m_lblk; > >> + unsigned int m_len = map->m_len; > >> + unsigned int mapped_len = 0, m_flags = 0; > >> + ext4_fsblk_t next_pblk; > >> + bool check_next_pblk = false; > >> + int ret = 0; > >> + > >> + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb)); > >> + > >> + /* > >> + * This is a slow path in case of mixed mapping. We use > >> + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single > >> + * contiguous mapped mapping. This will ensure any unwritten or hole > >> + * regions within the requested range is zeroed out and we return > >> + * a single contiguous mapped extent. > >> + */ > >> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO; > >> + > >> + do { > >> + ret = ext4_map_blocks(handle, inode, map, m_flags); > >> + if (ret < 0 && ret != -ENOSPC) > >> + goto out_err; > >> + /* > >> + * This should never happen, but let's return an error code to > >> + * avoid an infinite loop in here. > >> + */ > >> + if (ret == 0) { > >> + ret = -EFSCORRUPTED; > >> + ext4_warning_inode(inode, > >> + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d", > >> + m_flags, ret); > >> + goto out_err; > >> + } > >> + /* > >> + * With bigalloc we should never get ENOSPC nor discontiguous > >> + * physical extents. > >> + */ > >> + if ((check_next_pblk && next_pblk != map->m_pblk) || > >> + ret == -ENOSPC) { > >> + ext4_warning_inode(inode, > >> + "Non-contiguous allocation detected: expected %llu, got %llu, " > >> + "or ext4_map_blocks() returned out of space ret: %d", > >> + next_pblk, map->m_pblk, ret); > >> + ret = -EFSCORRUPTED; > >> + goto out_err; > >> + } > >> + next_pblk = map->m_pblk + map->m_len; > >> + check_next_pblk = true; > >> + > >> + mapped_len += map->m_len; > >> + map->m_lblk += map->m_len; > >> + map->m_len = m_len - mapped_len; > >> + } while (mapped_len < m_len); > >> + > >> + /* > >> + * We might have done some work in above loop, so we need to query the > >> + * start of the physical extent, based on the origin m_lblk and m_len. > >> + * Let's also ensure we were able to allocate the required range for > >> + * mixed mapping case. > >> + */ > >> + map->m_lblk = m_lblk; > >> + map->m_len = m_len; > >> + map->m_flags = 0; > >> + > >> + ret = ext4_map_blocks(handle, inode, map, > >> + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF); > >> + if (ret != m_len) { > >> + ext4_warning_inode(inode, > >> + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n", > >> + m_lblk, m_len, ret); > >> + ret = -EINVAL; > > > > When does this produce a short mapping? In theory the cluster's already > > allocated, right? So this is (AFAICT) a handler for a "should never > > happen" corner case, right? > > > > Yes, you are right. This will ideally never happen. > > >> + } > >> + return ret; > >> + > >> +out_err: > >> + /* reset map before returning an error */ > >> + map->m_lblk = m_lblk; > >> + map->m_len = m_len; > >> + map->m_flags = 0; > >> + return ret; > >> +} > >> + > >> +/* > >> + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested > >> + * range in @map [lblk, lblk + len) is one single contiguous extent with no > >> + * mixed mappings. > >> + * > >> + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()). > >> + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying > >> + * physical extent for the requested range does not have a single contiguous > >> + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout. > >> + * In that case we will loop over the requested range to allocate and zero out > >> + * the unwritten / holes in between, to get a single mapped extent from > >> + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know > >> + * this can be called only with bigalloc enabled filesystem where the underlying > >> + * cluster is already allocated. This avoids allocating discontiguous extents > >> + * in the slow path due to multiple calls to ext4_map_blocks(). > >> + * The slow path is mostly non-performance critical path, so it should be ok to > >> + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the > >> + * underlying short holes/unwritten extents within the requested range. > >> + */ > >> +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode, > >> + struct ext4_map_blocks *map, int m_flags, > >> + bool *force_commit) > >> +{ > >> + ext4_lblk_t m_lblk = map->m_lblk; > >> + unsigned int m_len = map->m_len; > >> + int ret = 0; > >> + > >> + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb)); > >> + > >> + ret = ext4_map_blocks(handle, inode, map, m_flags); > >> + if (ret < 0 || ret == m_len) > >> + goto out; > >> + /* > >> + * This is a mixed mapping case where we were not able to allocate > >> + * a single contiguous extent. In that case let's reset requested > >> + * mapping and call the slow path. > >> + */ > >> + map->m_lblk = m_lblk; > >> + map->m_len = m_len; > >> + map->m_flags = 0; > >> + > >> + /* > >> + * slow path means we have mixed mapping, that means we will need > >> + * to force txn commit. > >> + */ > >> + *force_commit = true; > >> + return ext4_map_blocks_atomic_write_slow(handle, inode, map); > >> +out: > >> + return ret; > >> +} > >> + > >> static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > >> unsigned int flags) > >> { > >> handle_t *handle; > >> u8 blkbits = inode->i_blkbits; > >> int ret, dio_credits, m_flags = 0, retries = 0; > >> + bool force_commit = false; > >> > >> /* > >> * Trim the mapping request to the maximum value that we can map at > >> @@ -3425,7 +3562,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > >> */ > >> if (map->m_len > DIO_MAX_BLOCKS) > >> map->m_len = DIO_MAX_BLOCKS; > >> - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); > >> + > >> + /* > >> + * journal credits estimation for atomic writes. We call > >> + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes, > >> + * then let's assume the no. of pextents required can be m_len i.e. > >> + * every alternate block can be unwritten and hole. > >> + */ > >> + if (flags & IOMAP_ATOMIC) { > >> + unsigned int orig_mlen = map->m_len; > >> + > >> + ret = ext4_map_blocks(NULL, inode, map, 0); > >> + if (ret < 0) > >> + return ret; > >> + if (map->m_len < orig_mlen) { > >> + map->m_len = orig_mlen; > >> + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen, > >> + map->m_len); > >> + } else { > >> + dio_credits = ext4_chunk_trans_blocks(inode, > >> + map->m_len); > >> + } > >> + } else { > >> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); > >> + } > >> > >> retry: > >> /* > >> @@ -3456,7 +3616,11 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > >> else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > >> m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > >> > >> - ret = ext4_map_blocks(handle, inode, map, m_flags); > >> + if (flags & IOMAP_ATOMIC) > >> + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags, > >> + &force_commit); > >> + else > >> + ret = ext4_map_blocks(handle, inode, map, m_flags); > >> > >> /* > >> * We cannot fill holes in indirect tree based inodes as that could > >> @@ -3470,6 +3634,14 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, > >> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > >> goto retry; > >> > >> + if (ret > 0 && force_commit) { > >> + int ret2; > >> + > >> + ret2 = ext4_force_commit(inode->i_sb); > >> + if (ret2) > >> + ret = ret2; > > > > Nit: This could return ret2 directly instead of assigning it to ret and > > letting it fall out. > > Ok. Will return ret2 directly here. > > > > > But my bigger complaint is that you ought to leave a comment here along > > the lines of: > > > > yes, I should have added a comment. > > > /* > > * Someone forced us to commit the journal ahead of an > > * IO operation so that the ondisk mapping state is > > * consistent with the contents of the file data blocks. > > * The commit failed, so we abort the whole IO. > > */ > > > > so it's obvious why we got a mapping but are erroring out anyway. > > /* > * Force commit the current transaction if the allocation spans a mixed > * mapping range. This ensures any pending metadata updates (like > * unwritten to written extents conversion) in this range are in > * consistent state with the file data blocks, before performing the > * actual write I/O. If the commit fails, the whole I/O must be aborted > * to prevent any possible torn writes. > */ > > I am thinking will add above ^^^ and something similar to a section > where we talk about how mixed mappings are handled in Documentation. > > > > > If the answers to my questions are all 'yes' and the extra comment gets > > added, then > > > > Acked-by: "Darrick J. Wong" <djwong@...nel.org> > > > > --D > > > > Thanks for the review! Since above answers are all yes, and I will add > the comment I just mentioned abovem so I will take your Acked by too and > soon send a v5. > > Please let me know otherwise. That sounds ok to me. I'm glad this is finally getting done. :) --D > > Thanks a lot for reviewing! > > -ritesh > > >> + } > >> + > >> return ret; > >> } > >> > >> @@ -3480,6 +3652,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > >> int ret; > >> struct ext4_map_blocks map; > >> u8 blkbits = inode->i_blkbits; > >> + unsigned int orig_mlen; > >> > >> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > >> return -EINVAL; > >> @@ -3493,6 +3666,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > >> map.m_lblk = offset >> blkbits; > >> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > >> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > >> + orig_mlen = map.m_len; > >> > >> if (flags & IOMAP_WRITE) { > >> /* > >> @@ -3503,8 +3677,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > >> */ > >> if (offset + length <= i_size_read(inode)) { > >> ret = ext4_map_blocks(NULL, inode, &map, 0); > >> - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) > >> - goto out; > >> + /* > >> + * For atomic writes the entire requested length should > >> + * be mapped. > >> + */ > >> + if (map.m_flags & EXT4_MAP_MAPPED) { > >> + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || > >> + (flags & IOMAP_ATOMIC && ret >= orig_mlen)) > >> + goto out; > >> + } > >> + map.m_len = orig_mlen; > >> } > >> ret = ext4_iomap_alloc(inode, &map, flags); > >> } else { > >> @@ -3525,6 +3707,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > >> */ > >> map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); > >> > >> + /* > >> + * Before returning to iomap, let's ensure the allocated mapping > >> + * covers the entire requested length for atomic writes. > >> + */ > >> + if (flags & IOMAP_ATOMIC) { > >> + if (map.m_len < (length >> blkbits)) { > >> + WARN_ON(1); > >> + return -EINVAL; > >> + } > >> + } > >> ext4_set_iomap(inode, iomap, &map, offset, length, flags); > >> > >> return 0; > >> -- > >> 2.49.0 > >> > >> >
Powered by blists - more mailing lists