linux-ext4 - Re: [PATCH v2 8/8] ext4: enable large folio for regular file

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a6225180-9983-4a0a-8898-435b014b8ebe@huaweicloud.com>
Date: Thu, 3 Jul 2025 22:13:51 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Theodore Ts'o <tytso@....edu>, "D, Suneeth" <Suneeth.D@....com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, willy@...radead.org, adilger.kernel@...ger.ca,
 jack@...e.cz, yi.zhang@...wei.com, libaokun1@...wei.com, yukuai3@...wei.com,
 yangerkun@...wei.com
Subject: Re: [PATCH v2 8/8] ext4: enable large folio for regular file

On 2025/6/26 22:56, Theodore Ts'o wrote:
> On Thu, Jun 26, 2025 at 09:26:41PM +0800, Zhang Yi wrote:
>>
>> Thanks for the report, I will try to reproduce this performance regression on
>> my machine and find out what caused this regression.
> 
> I took a quick look at this, and I *think* it's because lmbench is
> measuring the latency of mmap read's --- I'm going to guess 4k random
> page faults, but I'm not sure.  If that's the case, this may just be a
> natural result of using large folios, and the tradeoff of optimizing
> for large reads versus small page faults.
> 
> But if you could take a closer look, that would be great, thanks!
> 

After analyzing what the lmbench mmap test actually does, I found that
the regression is related to the mmap writes, not mmap reads. In other
words, the latency increases in ext4_page_mkwrite() after we enable
large folios.

The lmbench mmap test performed the following two tests:
1. mmap a range with PROT_READ|PROT_WRITE and MAP_SHARED, and then
   write one byte every 16KB sequentially.
2. mmap a range with PROT_READ and MAP_SHARED, and then read byte
   one by one sequentially.

For the mmap read test, the average page fault latency on my machine
can be improved from 3,634 ns to 2,005 ns. This improvement is due to
the ability to save the folio readahead loop in page_cache_async_ra()
and the set PTE loop in filemap_map_pages() after implementing support
for large folios.

For the mmap write test, the number of page faults does not decrease
due to the large folio (the maximum order is 5), each page still
incurs one page fault. However, the ext4_page_mkwrite() does multiple
iterations through buffer_head in the folio, so the time consumption
will increase. The latency of ext4_page_mkwrite() can be increased
from 958ns to 1596ns.

After looking at the comments in finish_fault() and 43e027e414232
("mm: memory: extend finish_fault() to support large folio").

vm_fault_t finish_fault(struct vm_fault *vmf)
{
	...
	nr_pages = folio_nr_pages(folio);

	/*
	 * Using per-page fault to maintain the uffd semantics, and same
	 * approach also applies to non-anonymous-shmem faults to avoid
	 * inflating the RSS of the process.
	 */
	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
	    unlikely(needs_fallback)) {
		nr_pages = 1;
	...
	set_pte_range(vmf, folio, page, nr_pages, addr);
}

I believe this regression can be resolved if the finish_fault()
supports file-based large folios, but I'm not sure if we are planning
to implement this.

As for ext4_page_mkwrite(), I think it can also be optimized by reducing
the number of the folio iterations, but this would make it impossible to
use existing generic helpers and could make the code very messy.

Best regards,
Yi.