linux-kernel - Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7AC9D6nOcU46ceWcLxCcPp=dezeOeaoMwsdHdSsLp85Ew@mail.gmail.com>
Date: Fri, 31 Oct 2025 14:58:58 +0800
From: Kairui Song <ryncsn@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, Chris Li <chrisl@...nel.org>, 
	Nhat Pham <nphamcs@...il.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Youngjun Park <youngjun.park@....com>, 
	Hugh Dickins <hughd@...gle.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, Kemeng Shi <shikemeng@...weicloud.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags
 (swap table phase II)

On Fri, Oct 31, 2025 at 7:05 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> > This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> > special swap bits including SWAP_HAS_CACHE, along with many historical
> > issues. The performance is about ~20% better for some workloads, like
> > Redis with persistence. This also cleans up the code to prepare for
> > later phases, some patches are from a previously posted series.
> >
> > Swap cache bypassing and swap synchronization in general had many
> > issues. Some are solved as workarounds, and some are still there [1]. To
> > resolve them in a clean way, one good solution is to always use swap
> > cache as the synchronization layer [2]. So we have to remove the swap
> > cache bypass swap-in path first. It wasn't very doable due to
> > performance issues, but now combined with the swap table, removing
> > the swap cache bypass path will instead improve the performance,
> > there is no reason to keep it.
> >
> > Now we can rework the swap entry and cache synchronization following
> > the new design. Swap cache synchronization was heavily relying on
> > SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> > of special swap map bits and related workarounds, we get a cleaner code
> > base and prepare for merging the swap count into the swap table in the
> > next step.
> >
> > Test results:
> >
> > Redis / Valkey bench:
> > =====================
> >
> > Testing on a ARM64 VM 1.5G memory:
> > Server: valkey-server --maxmemory 2560M
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 460475.84 RPS               311591.19 RPS
> > After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> >
> > Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> > Server:
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 306044.38 RPS               102745.88 RPS
> > After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> >
> > The performance is a lot better when persistence is applied. This should
> > apply to many other workloads that involve sharing memory and COW. A
> > slight performance drop was observed for the ARM64 Redis test: We are
> > still using swap_map to track the swap count, which is causing redundant
> > cache and CPU overhead and is not very performance-friendly for some
> > arches. This will be improved once we merge the swap map into the swap
> > table (as already demonstrated previously [3]).
> >
> > vm-scabiity
> > ===========
> > usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> > simulated PMEM as swap), average result of 6 test run:
> >
> >                            Before:         After:
> > System time:               282.22s         283.47s
> > Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> > Single process Throughput: 176.41 MB/s     176.23 MB/s
> > Free latency:              518477.96 us    521488.06 us
> >
> > Which is almost identical.
> >
> > Build kernel test:
> > ==================
> > Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1379.91s          1364.22s (-0.11%)
> >
> > Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1822.52s          1803.33s (-0.11%)
> >
> > Which is almost identical.
> >
> > MySQL:
> > ======
> > sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> > --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> > 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> >
> > Before: 318162.18 qps
> > After:  318512.01 qps (+0.01%)
> >
> > In conclusion, the result is looking better or identical for most cases,
> > and it's especially better for workloads with swap count > 1 on SYNC_IO
> > devices, about ~20% gain in above test. Next phases will start to merge
> > swap count into swap table and reduce memory usage.
> >
> > One more gain here is that we now have better support for THP swapin.
> > Previously, the THP swapin was bound with swap cache bypassing, which
> > only works for single-mapped folios. Removing the bypassing path also
> > enabled THP swapin for all folios. It's still limited to SYNC_IO
> > devices, though, this limitation can will be removed later. This may
> > cause more serious thrashing for certain workloads, but that's not an
> > issue caused by this series, it's a common THP issue we should resolve
> > separately.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> >
> > Suggested-by: Chris Li <chrisl@...nel.org>
> > Signed-off-by: Kairui Song <kasong@...cent.com>
>
> Unfortunately I don't have time to go through the series and review it,
> but I wanted to just say awesome work here. The special cases in the
> swap code to avoid using the swapcache have always been a pain.
>
> In fact, there's one more special case that we can probably remove in
> zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
> fix data loss on SWP_SYNCHRONOUS_IO devices").

Thanks! Oh, now I remember that one, it can be removed indeed. There
are several more cleanup and optimizations that can be done after this
series, it's getting too long already so I didn't include everything.

But removing 25cd241408a2 is easy to do and easy to review, I can
include it in the next update.