linux-kernel - Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec023298-c26d-437b-a023-b49509f83c5a@h-partners.com>
Date: Mon, 12 Jan 2026 11:30:38 +0300
From: Gladyshev Ilya <gladyshev.ilya1@...artners.com>
To: <gladyshev.ilya1@...artners.com>
CC: <guohanjun@...wei.com>, <wangkefeng.wang@...wei.com>,
	<weiyongjun1@...wei.com>, <yusongping@...wei.com>, <leijitang@...wei.com>,
	<artem.kuzin@...wei.com>, <stepanov.anatoly@...wei.com>,
	<alexander.grubnikov@...wei.com>, <gorbunov.ivan@...artners.com>,
	<akpm@...ux-foundation.org>, <david@...nel.org>,
	<lorenzo.stoakes@...cle.com>, <Liam.Howlett@...cle.com>, <vbabka@...e.cz>,
	<rppt@...nel.org>, <surenb@...gle.com>, <mhocko@...e.com>, <ziy@...dia.com>,
	<harry.yoo@...cle.com>, <willy@...radead.org>, <yuzhao@...gle.com>,
	<baolin.wang@...ux.alibaba.com>, <muchun.song@...ux.dev>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 0/2] mm: improve folio refcount scalability

Gentle ping on this proposal

> Intro
> =====
> This patch optimizes small file read performance and overall folio refcount
> scalability by refactoring page_ref_add_unless [core of folio_try_get].
> This is alternative approach to previous attempts to fix small read
> performance by avoiding refcount bumps [1][2].
> 
> Overview
> ========
> Current refcount implementation is using zero counter as locked (dead/frozen)
> state, which required CAS loop for increments to avoid temporary unlocks in
> try_get functions. These CAS loops became a serialization point for otherwise
> scalable and fast read side.
> 
> Proposed implementation separates "locked" logic from the counting, allowing
> the use of optimistic fetch_add() instead of CAS. For more details, please
> refer to the commit message of the patch itself.
> 
> Proposed logic maintains the same public API as before, including all existing
> memory barrier guarantees.
> 
> Drawbacks
> =========
> In theory, an optimistic fetch_add can overflow the atomic_t and reset the
> locked state. Currently, this is mitigated via a single CAS operation after
> the "failed" fetch_add, which tries to reset the counter to a locked zero.
> While this best-effort approach doesn't have any strong guarantees, it's
> unrealistic that there will be 2^31 highly contended try_get calls on a locked
> folio, and in each of these calls, the CAS operation will fail.
> 
> If this guarantee isn't sufficient, it can be improved by performing a full
> CAS loop when the counter is approaching overflow.
> 
> Performance
> ===========
> Performance was measured using a simple custom benchmark based on
> will-it-scale[3]. This benchmark spawns N pinned threads/processes that
> execute the following loop:
> ``
> char buf[]
> fd = open(/* same file in tmpfs */);
> 
> while (true) {
>      pread(fd, buf, /* read size = */ 64, /* offset = */0)
> }
> ``
> While this is a synthetic load, it does highlight existing issue and
> doesn't differ a lot from benchmarking in [2] patch.
> 
> This benchmark measures operations per second in the inner loop and the
> results across all workers. Performance was tested on top of v6.15 kernel[4]
> on two platforms. Since threads and processes showed similar performance on
> both systems, only the thread results are provided below. The performance
> improvement scales linearly between the CPU counts shown.
> 
> Platform 1: 2 x E5-2690 v3, 12C/12T each [disabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1343381 | 1344401 |  +0.1
>         2 | 2186160 | 2455837 | +12.3
>         5 | 5277092 | 6108030 | +15.7
>        10 | 5858123 | 7506328 | +28.1
>        12 | 6484445 | 8137706 | +25.5
>           /* Cross socket NUMA */
>        14 | 3145860 | 4247391 | +35.0
>        16 | 2350840 | 4262707 | +81.3
>        18 | 2378825 | 4121415 | +73.2
>        20 | 2438475 | 4683548 | +92.1
>        24 | 2325998 | 4529737 | +94.7
> 
> Platform 2: 2 x AMD EPYC 9654, 96C/192T each [enabled SMT]
> 
> #threads | vanilla | patched | boost (%)
>         1 | 1077276 | 1081653 |  +0.4
>         5 | 4286838 | 4682513 |  +9.2
>        10 | 1698095 | 1902753 | +12.1
>        20 | 1662266 | 1921603 | +15.6
>        49 | 1486745 | 1828926 | +23.0
>        97 | 1617365 | 2052635 | +26.9
>           /* Cross socket NUMA */
>       105 | 1368319 | 1798862 | +31.5
>       136 | 1008071 | 1393055 | +38.2
>       168 |  879332 | 1245210 | +41.6
>                 /* SMT */
>       193 |  905432 | 1294833 | +43.0
>       289 |  851988 | 1313110 | +54.1
>       353 |  771288 | 1347165 | +74.7
> 
> [1] https://lore.kernel.org/linux-mm/CAHk-=wj00-nGmXEkxY=-=Z_qP6kiGUziSFvxHJ9N-cLWry5zpA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-mm/20251017141536.577466-1-kirill@shutemov.name/
> [3] https://github.com/antonblanchard/will-it-scale
> [4] There were no changes to page_ref.h between v6.15 and v6.18 or any
>      significant performance changes on the read side in mm/filemap.c
> 
> Gladyshev Ilya (2):
>    mm: make ref_unless functions unless_zero only
>    mm: implement page refcount locking via dedicated bit
> 
>   include/linux/mm.h         |  2 +-
>   include/linux/page-flags.h |  9 ++++++---
>   include/linux/page_ref.h   | 35 ++++++++++++++++++++++++++---------
>   3 files changed, 33 insertions(+), 13 deletions(-)
>