linux-kernel - Re: [PATCH] mm/filemap: Align last

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86a82f42-918c-45f8-ac49-2b1f341ee0d3@arm.com>
Date: Mon, 14 Jul 2025 09:16:02 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>,
 Youling Tang <youling.tang@...ux.dev>, Matthew Wilcox <willy@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: Jan Kara <jack@...e.cz>, linux-fsdevel@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, chizhiling@....com,
 Youling Tang <tangyouling@...inos.cn>, Chi Zhiling <chizhiling@...inos.cn>
Subject: Re: [PATCH] mm/filemap: Align last_index to folio size

On 11/07/2025 17:08, David Hildenbrand wrote:
> CCing Ryan, who recently fiddled with readahead.
> 
> 
> On 11.07.25 07:55, Youling Tang wrote:
>> From: Youling Tang <tangyouling@...inos.cn>
>>
>> On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE
>> enabled, We observed the following readahead behaviors:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=64k count=1
>>   # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136d4c    __RU_l_________H______t_________________F_1
>>   1    136d4d    __RU_l__________T_____t_________________F_1
>>   2    136d4e    __RU_l__________T_____t_________________F_1
>>   3    136d4f    __RU_l__________T_____t_________________F_1
>>   ...
>>   c    136bb8    __RU_l_________H______t_________________F_1
>>   d    136bb9    __RU_l__________T_____t_________________F_1
>>   e    136bba    __RU_l__________T_____t_________________F_1
>>   f    136bbb    __RU_l__________T_____t_________________F_1   <-- first read
>>   10    13c2cc    ___U_l_________H______t______________I__F_1   <-- readahead
>> flag
>>   11    13c2cd    ___U_l__________T_____t______________I__F_1
>>   12    13c2ce    ___U_l__________T_____t______________I__F_1
>>   13    13c2cf    ___U_l__________T_____t______________I__F_1
>>   ...
>>   1c    1405d4    ___U_l_________H______t_________________F_1
>>   1d    1405d5    ___U_l__________T_____t_________________F_1
>>   1e    1405d6    ___U_l__________T_____t_________________F_1
>>   1f    1405d7    ___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136048    __RU_l_________H______t_________________F_1
>>   ...
>>   c    110a40    __RU_l_________H______t_________________F_1
>>   d    110a41    __RU_l__________T_____t_________________F_1
>>   e    110a42    __RU_l__________T_____t_________________F_1   <-- first read
>>   f    110a43    __RU_l__________T_____t_________________F_1   <-- first
>> readahead flag
>>   10    13e7a8    ___U_l_________H______t_________________F_1
>>   ...
>>   20    137a00    ___U_l_________H______t_______P______I__F_1   <-- second
>> readahead flag (20 - 2f)
>>   21    137a01    ___U_l__________T_____t_______P______I__F_1
>>   ...
>>   3f    10d4af    ___U_l__________T_____t_______P_________F_1
>>   [first readahead: ra_size = 32, req_count = 15, async_size = 17]
>>
>> When reading 64k data (same for 61-63k range, where last_index is page-aligned
>> in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra()
>> and the PG_readahead flag is set on the next folio (the one containing 0x10
>> page).
>>
>> When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra().
>> However, in this case the readahead flag is set on the 0xf page. Although the
>> requested read size (req_count) is 60k, the actual read will be aligned to
>> folio size (64k), which triggers the readahead flag and initiates asynchronous
>> readahead via page_cache_async_ra(). This results in two readahead operations
>> totaling 256k.
>>
>> The root cause is that when the requested size is smaller than the actual read
>> size (due to folio alignment), it triggers asynchronous readahead. By changing
>> last_index alignment from page size to folio size, we ensure the requested size
>> matches the actual read size, preventing the case where a single read operation
>> triggers two readahead operations.

I recently fiddled with mmap readahead paths, doing similar-ish things. I
haven't looked at the non-mmap paths so don't consider myself expert here. But
what you are saying makes sense and superficially the solution looks good to me, so:

Reviewed-by: Ryan Roberts <ryan.roberts@....com>

with one nit below...

>>
>> After applying the patch:
>>   # echo 3 > /proc/sys/vm/drop_caches
>>   # dd if=test of=/dev/null bs=60k count=1
>>   # ./page-types -r -L -f  /mnt/xfs/test
>>   foffset    offset    flags
>>   0    136d4c    __RU_l_________H______t_________________F_1
>>   1    136d4d    __RU_l__________T_____t_________________F_1
>>   2    136d4e    __RU_l__________T_____t_________________F_1
>>   3    136d4f    __RU_l__________T_____t_________________F_1
>>   ...
>>   c    136bb8    __RU_l_________H______t_________________F_1
>>   d    136bb9    __RU_l__________T_____t_________________F_1
>>   e    136bba    __RU_l__________T_____t_________________F_1   <-- first read
>>   f    136bbb    __RU_l__________T_____t_________________F_1
>>   10    13c2cc    ___U_l_________H______t______________I__F_1   <-- readahead
>> flag
>>   11    13c2cd    ___U_l__________T_____t______________I__F_1
>>   12    13c2ce    ___U_l__________T_____t______________I__F_1
>>   13    13c2cf    ___U_l__________T_____t______________I__F_1
>>   ...
>>   1c    1405d4    ___U_l_________H______t_________________F_1
>>   1d    1405d5    ___U_l__________T_____t_________________F_1
>>   1e    1405d6    ___U_l__________T_____t_________________F_1
>>   1f    1405d7    ___U_l__________T_____t_________________F_1
>>   [ra_size = 32, req_count = 16, async_size = 16]
>>
>> The same phenomenon will occur when reading from 49k to 64k. Set the readahead
>> flag to the next folio.
>>
>> Because the minimum order of folio in address_space equals the block size (at
>> least in xfs and bcachefs that already support bs > ps), having request_count
>> aligned to block size will not cause overread.
>>
>> Co-developed-by: Chi Zhiling <chizhiling@...inos.cn>
>> Signed-off-by: Chi Zhiling <chizhiling@...inos.cn>
>> Signed-off-by: Youling Tang <tangyouling@...inos.cn>
>> ---
>>   include/linux/pagemap.h | 6 ++++++
>>   mm/filemap.c            | 5 +++--
>>   2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e63fbfbd5b0f..447bb264fd94 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -480,6 +480,12 @@ mapping_min_folio_nrpages(struct address_space *mapping)
>>       return 1UL << mapping_min_folio_order(mapping);
>>   }
>>   +static inline unsigned long
>> +mapping_min_folio_nrbytes(struct address_space *mapping)
>> +{
>> +    return mapping_min_folio_nrpages(mapping) << PAGE_SHIFT;
>> +}
>> +
>>   /**
>>    * mapping_align_index() - Align index for this mapping.
>>    * @mapping: The address_space.
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 765dc5ef6d5a..56a8656b6f86 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2584,8 +2584,9 @@ static int filemap_get_pages(struct kiocb *iocb, size_t
>> count,
>>       unsigned int flags;
>>       int err = 0;
>>   -    /* "last_index" is the index of the page beyond the end of the read */
>> -    last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
>> +    /* "last_index" is the index of the folio beyond the end of the read */

pedantic nit: I think you actually mean "the index of the first page within the
first minimum-sized folio beyond the end of the read"?

Thanks,
Ryan

>> +    last_index = round_up(iocb->ki_pos + count,
>> mapping_min_folio_nrbytes(mapping));
>> +    last_index >>= PAGE_SHIFT;
>>   retry:
>>       if (fatal_signal_pending(current))
>>           return -EINTR;
> 
>