lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94de227e-23c1-4089-b99c-e8fc0beae5da@huaweicloud.com>
Date: Thu, 26 Jun 2025 21:26:41 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: "D, Suneeth" <Suneeth.D@....com>, linux-ext4@...r.kernel.org
Cc: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 willy@...radead.org, tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz,
 yi.zhang@...wei.com, libaokun1@...wei.com, yukuai3@...wei.com,
 yangerkun@...wei.com
Subject: Re: [PATCH v2 8/8] ext4: enable large folio for regular file

Hello Suneeth D!

On 2025/6/26 19:29, D, Suneeth wrote:
> 
> Hello Zhang Yi,
> 
> On 5/12/2025 12:03 PM, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
>> large folios for regular files. Enable this feature by default. However,
>> since we cannot change the folio order limitation of mappings on active
>> inodes, setting the journal=data mode via ioctl on an active inode will
>> not take immediate effect in non-delalloc mode.
>>
> 
> We run lmbench3 as part of our Weekly CI for the purpose of Kernel Performance Regression testing between a stable vs rc kernel. We noticed a regression on the kernels starting from 6.16-rc1 all the way through 6.16-rc3 in the range of 8-12%. Further bisection b/w 6.15 and 6.16-rc1 pointed me to the first bad commit as 7ac67301e82f02b77a5c8e7377a1f414ef108b84. The following were the machine configurations and test parameters used:-
> 
> Model name:           AMD EPYC 9754 128-Core Processor [Bergamo]
> Thread(s) per core:   2
> Core(s) per socket:   128
> Socket(s):            1
> Total online memory:  258G
> 
> micro-benchmark_variant: "lmbench3-development-1-0-MMAP-50%" which has the following parameters,
> 
> -> nr_thread:     1
> -> memory_size: 50%
> -> mode:     development
> -> test:        MMAP
> 
> The following are the stats after bisection:-
> 
> (the KPI used here is lmbench3.MMAP.read.latency.us)
> 
> v6.15 -                         97.3K
> 
> v6.16-rc1 -                         107.5K
> 
> v6.16-rc3 -                         107.4K
> 
> 6.15.0-rc4badcommit -                     103.5K
> 
> 6.15.0-rc4badcommit_m1 (one commit before bad-commit) - 94.2K

Thanks for the report, I will try to reproduce this performance regression on
my machine and find out what caused this regression.

Thanks,
Yi.

> 
> I also ran the micro-benchmark with tools/testing/perf record and following is the output from tools/testing/perf diff b/w the bad commit and just one commit before that.
> 
> # ./perf diff perf.data.old  perf.data
> No kallsyms or vmlinux with build-id da8042fb274c5e3524318e5e3afbeeef5df2055e was found
> # Event 'cycles:P'
> #
> # Baseline  Delta Abs  Shared Object            Symbol
> 
>            >
> # ........  .........  ....................... ....................................................................................................................................................................................>
> #
>                +4.34%  [kernel.kallsyms]        [k] __lruvec_stat_mod_folio
>                +3.41%  [kernel.kallsyms]        [k] unmap_page_range
>                +3.33%  [kernel.kallsyms]        [k] __mod_memcg_lruvec_state
>                +2.04%  [kernel.kallsyms]        [k] srso_alias_return_thunk
>                +2.02%  [kernel.kallsyms]        [k] srso_alias_safe_ret
>     22.22%     -1.78%  bw_mmap_rd               [.] bread
>                +1.76%  [kernel.kallsyms]        [k] __handle_mm_fault
>                +1.70%  [kernel.kallsyms]        [k] filemap_map_pages
>                +1.58%  [kernel.kallsyms]        [k] set_pte_range
>                +1.58%  [kernel.kallsyms]        [k] next_uptodate_folio
>                +1.33%  [kernel.kallsyms]        [k] do_anonymous_page
>                +1.01%  [kernel.kallsyms]        [k] get_page_from_freelist
>                +0.98%  [kernel.kallsyms]        [k] __mem_cgroup_charge
>                +0.85%  [kernel.kallsyms]        [k] asm_exc_page_fault
>                +0.82%  [kernel.kallsyms]        [k] native_irq_return_iret
>                +0.82%  [kernel.kallsyms]        [k] do_user_addr_fault
>                +0.77%  [kernel.kallsyms]        [k] clear_page_erms
>                +0.75%  [kernel.kallsyms]        [k] handle_mm_fault
>                +0.73%  [kernel.kallsyms]        [k] set_ptes.isra.0
>                +0.70%  [kernel.kallsyms]        [k] lru_add
>                +0.69%  [kernel.kallsyms]        [k] folio_add_file_rmap_ptes
>                +0.68%  [kernel.kallsyms]        [k] folio_remove_rmap_ptes
>     12.45%     -0.65%  line                     [.] mem_benchmark_0
>                +0.64%  [kernel.kallsyms]        [k] __alloc_frozen_pages_noprof
>                +0.63%  [kernel.kallsyms]        [k] vm_normal_page
>                +0.63%  [kernel.kallsyms]        [k] free_pages_and_swap_cache
>                +0.63%  [kernel.kallsyms]        [k] lock_vma_under_rcu
>                +0.60%  [kernel.kallsyms]        [k] __rcu_read_unlock
>                +0.59%  [kernel.kallsyms]        [k] cgroup_rstat_updated
>                +0.57%  [kernel.kallsyms]        [k] get_mem_cgroup_from_mm
>                +0.52%  [kernel.kallsyms]        [k] __mod_lruvec_state
>                +0.51%  [kernel.kallsyms]        [k] exc_page_fault
> 
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>>   fs/ext4/ext4.h      |  1 +
>>   fs/ext4/ext4_jbd2.c |  3 ++-
>>   fs/ext4/ialloc.c    |  3 +++
>>   fs/ext4/inode.c     | 20 ++++++++++++++++++++
>>   4 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 5a20e9cd7184..2fad90c30493 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -2993,6 +2993,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>>                        struct buffer_head *bh));
>>   int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>>                   struct buffer_head *bh);
>> +bool ext4_should_enable_large_folio(struct inode *inode);
>>   #define FALL_BACK_TO_NONDELALLOC 1
>>   #define CONVERT_INLINE_DATA     2
>>   diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
>> index 135e278c832e..b3e9b7bd7978 100644
>> --- a/fs/ext4/ext4_jbd2.c
>> +++ b/fs/ext4/ext4_jbd2.c
>> @@ -16,7 +16,8 @@ int ext4_inode_journal_mode(struct inode *inode)
>>           ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
>>           test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
>>           (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
>> -        !test_opt(inode->i_sb, DELALLOC))) {
>> +        !test_opt(inode->i_sb, DELALLOC) &&
>> +        !mapping_large_folio_support(inode->i_mapping))) {
>>           /* We do not support data journalling for encrypted data */
>>           if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
>>               return EXT4_INODE_ORDERED_DATA_MODE;  /* ordered */
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index e7ecc7c8a729..4938e78cbadc 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -1336,6 +1336,9 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
>>           }
>>       }
>>   +    if (ext4_should_enable_large_folio(inode))
>> +        mapping_set_large_folios(inode->i_mapping);
>> +
>>       ext4_update_inode_fsync_trans(handle, inode, 1);
>>         err = ext4_mark_inode_dirty(handle, inode);
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 29eccdf8315a..7fd3921cfe46 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4774,6 +4774,23 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
>>       return -EFSCORRUPTED;
>>   }
>>   +bool ext4_should_enable_large_folio(struct inode *inode)
>> +{
>> +    struct super_block *sb = inode->i_sb;
>> +
>> +    if (!S_ISREG(inode->i_mode))
>> +        return false;
>> +    if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
>> +        ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
>> +        return false;
>> +    if (ext4_has_feature_verity(sb))
>> +        return false;
>> +    if (ext4_has_feature_encrypt(sb))
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>>   struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>>                 ext4_iget_flags flags, const char *function,
>>                 unsigned int line)
>> @@ -5096,6 +5113,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>>           ret = -EFSCORRUPTED;
>>           goto bad_inode;
>>       }
>> +    if (ext4_should_enable_large_folio(inode))
>> +        mapping_set_large_folios(inode->i_mapping);
>> +
>>       ret = check_igot_inode(inode, flags, function, line);
>>       /*
>>        * -ESTALE here means there is nothing inherently wrong with the inode,
> 
> ---
> Thanks and Regards,
> Suneeth D


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ