lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f59ef632-0d11-4ae7-bdad-d552fe1f1d78@amd.com>
Date: Thu, 26 Jun 2025 16:59:36 +0530
From: "D, Suneeth" <Suneeth.D@....com>
To: Zhang Yi <yi.zhang@...weicloud.com>, <linux-ext4@...r.kernel.org>
CC: <linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<willy@...radead.org>, <tytso@....edu>, <adilger.kernel@...ger.ca>,
	<jack@...e.cz>, <yi.zhang@...wei.com>, <libaokun1@...wei.com>,
	<yukuai3@...wei.com>, <yangerkun@...wei.com>
Subject: Re: [PATCH v2 8/8] ext4: enable large folio for regular file


Hello Zhang Yi,

On 5/12/2025 12:03 PM, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@...wei.com>
> 
> Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
> large folios for regular files. Enable this feature by default. However,
> since we cannot change the folio order limitation of mappings on active
> inodes, setting the journal=data mode via ioctl on an active inode will
> not take immediate effect in non-delalloc mode.
> 

We run lmbench3 as part of our Weekly CI for the purpose of Kernel 
Performance Regression testing between a stable vs rc kernel. We noticed 
a regression on the kernels starting from 6.16-rc1 all the way through 
6.16-rc3 in the range of 8-12%. Further bisection b/w 6.15 and 6.16-rc1 
pointed me to the first bad commit as 
7ac67301e82f02b77a5c8e7377a1f414ef108b84. The following were the machine 
configurations and test parameters used:-

Model name:           AMD EPYC 9754 128-Core Processor [Bergamo]
Thread(s) per core:   2
Core(s) per socket:   128
Socket(s):            1
Total online memory:  258G

micro-benchmark_variant: "lmbench3-development-1-0-MMAP-50%" which has 
the following parameters,

-> nr_thread: 	1
-> memory_size: 50%
-> mode: 	development
-> test:        MMAP

The following are the stats after bisection:-

(the KPI used here is lmbench3.MMAP.read.latency.us)

v6.15 - 						97.3K

v6.16-rc1 - 						107.5K

v6.16-rc3 - 						107.4K

6.15.0-rc4badcommit - 					103.5K

6.15.0-rc4badcommit_m1 (one commit before bad-commit) - 94.2K

I also ran the micro-benchmark with tools/testing/perf record and 
following is the output from tools/testing/perf diff b/w the bad commit 
and just one commit before that.

# ./perf diff perf.data.old  perf.data
No kallsyms or vmlinux with build-id 
da8042fb274c5e3524318e5e3afbeeef5df2055e was found
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object            Symbol 
 
 
            >
# ........  .........  ....................... 
....................................................................................................................................................................................>
#
                +4.34%  [kernel.kallsyms]        [k] __lruvec_stat_mod_folio
                +3.41%  [kernel.kallsyms]        [k] unmap_page_range
                +3.33%  [kernel.kallsyms]        [k] 
__mod_memcg_lruvec_state
                +2.04%  [kernel.kallsyms]        [k] srso_alias_return_thunk
                +2.02%  [kernel.kallsyms]        [k] srso_alias_safe_ret
     22.22%     -1.78%  bw_mmap_rd               [.] bread
                +1.76%  [kernel.kallsyms]        [k] __handle_mm_fault
                +1.70%  [kernel.kallsyms]        [k] filemap_map_pages
                +1.58%  [kernel.kallsyms]        [k] set_pte_range
                +1.58%  [kernel.kallsyms]        [k] next_uptodate_folio
                +1.33%  [kernel.kallsyms]        [k] do_anonymous_page
                +1.01%  [kernel.kallsyms]        [k] get_page_from_freelist
                +0.98%  [kernel.kallsyms]        [k] __mem_cgroup_charge
                +0.85%  [kernel.kallsyms]        [k] asm_exc_page_fault
                +0.82%  [kernel.kallsyms]        [k] native_irq_return_iret
                +0.82%  [kernel.kallsyms]        [k] do_user_addr_fault
                +0.77%  [kernel.kallsyms]        [k] clear_page_erms
                +0.75%  [kernel.kallsyms]        [k] handle_mm_fault
                +0.73%  [kernel.kallsyms]        [k] set_ptes.isra.0
                +0.70%  [kernel.kallsyms]        [k] lru_add
                +0.69%  [kernel.kallsyms]        [k] 
folio_add_file_rmap_ptes
                +0.68%  [kernel.kallsyms]        [k] folio_remove_rmap_ptes
     12.45%     -0.65%  line                     [.] mem_benchmark_0
                +0.64%  [kernel.kallsyms]        [k] 
__alloc_frozen_pages_noprof
                +0.63%  [kernel.kallsyms]        [k] vm_normal_page
                +0.63%  [kernel.kallsyms]        [k] 
free_pages_and_swap_cache
                +0.63%  [kernel.kallsyms]        [k] lock_vma_under_rcu
                +0.60%  [kernel.kallsyms]        [k] __rcu_read_unlock
                +0.59%  [kernel.kallsyms]        [k] cgroup_rstat_updated
                +0.57%  [kernel.kallsyms]        [k] get_mem_cgroup_from_mm
                +0.52%  [kernel.kallsyms]        [k] __mod_lruvec_state
                +0.51%  [kernel.kallsyms]        [k] exc_page_fault

> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> ---
>   fs/ext4/ext4.h      |  1 +
>   fs/ext4/ext4_jbd2.c |  3 ++-
>   fs/ext4/ialloc.c    |  3 +++
>   fs/ext4/inode.c     | 20 ++++++++++++++++++++
>   4 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 5a20e9cd7184..2fad90c30493 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2993,6 +2993,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>   				     struct buffer_head *bh));
>   int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>   				struct buffer_head *bh);
> +bool ext4_should_enable_large_folio(struct inode *inode);
>   #define FALL_BACK_TO_NONDELALLOC 1
>   #define CONVERT_INLINE_DATA	 2
>   
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 135e278c832e..b3e9b7bd7978 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -16,7 +16,8 @@ int ext4_inode_journal_mode(struct inode *inode)
>   	    ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
>   	    test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
>   	    (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
> -	    !test_opt(inode->i_sb, DELALLOC))) {
> +	    !test_opt(inode->i_sb, DELALLOC) &&
> +	    !mapping_large_folio_support(inode->i_mapping))) {
>   		/* We do not support data journalling for encrypted data */
>   		if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
>   			return EXT4_INODE_ORDERED_DATA_MODE;  /* ordered */
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index e7ecc7c8a729..4938e78cbadc 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1336,6 +1336,9 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
>   		}
>   	}
>   
> +	if (ext4_should_enable_large_folio(inode))
> +		mapping_set_large_folios(inode->i_mapping);
> +
>   	ext4_update_inode_fsync_trans(handle, inode, 1);
>   
>   	err = ext4_mark_inode_dirty(handle, inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 29eccdf8315a..7fd3921cfe46 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4774,6 +4774,23 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
>   	return -EFSCORRUPTED;
>   }
>   
> +bool ext4_should_enable_large_folio(struct inode *inode)
> +{
> +	struct super_block *sb = inode->i_sb;
> +
> +	if (!S_ISREG(inode->i_mode))
> +		return false;
> +	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
> +	    ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> +		return false;
> +	if (ext4_has_feature_verity(sb))
> +		return false;
> +	if (ext4_has_feature_encrypt(sb))
> +		return false;
> +
> +	return true;
> +}
> +
>   struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   			  ext4_iget_flags flags, const char *function,
>   			  unsigned int line)
> @@ -5096,6 +5113,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   		ret = -EFSCORRUPTED;
>   		goto bad_inode;
>   	}
> +	if (ext4_should_enable_large_folio(inode))
> +		mapping_set_large_folios(inode->i_mapping);
> +
>   	ret = check_igot_inode(inode, flags, function, line);
>   	/*
>   	 * -ESTALE here means there is nothing inherently wrong with the inode,

---
Thanks and Regards,
Suneeth D
View attachment "lmbench_steps.txt" of type "text/plain" (1078 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ