linux-kernel - [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
Date: Fri, 24 May 2024 10:17:17 -0700
From: Chris Li <chrisl@...nel.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Kairui Song <kasong@...cent.com>, Ryan Roberts <ryan.roberts@....com>, 
 "Huang, Ying" <ying.huang@...el.com>, linux-kernel@...r.kernel.org, 
 linux-mm@...ck.org, Chris Li <chrisl@...nel.org>, 
 Barry Song <baohua@...nel.org>
Subject: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster
 order

This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.

The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.

Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:

I'm able to reproduce such an issue with a simple script (enabling all order of mthp):

modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0

rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &

/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
        -P memcache_binary -n allkeys --key-minimum=1 \
        --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
        --ratio 1:0 --pipeline 8 -d 1024

Before:
Totals      48805.63         0.00         0.00         5.26045         1.19100        38.91100        59.64700     51063.98
After:
Totals      71098.84         0.00         0.00         3.60585         0.71100        26.36700        39.16700     74388.74

And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596

(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)

After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52

Reported-by: Barry Song <21cnbao@...il.com>
Tested-by: Kairui Song <kasong@...cent.com>
Signed-off-by: Chris Li <chrisl@...nel.org>
---
Chris Li (2):
      mm: swap: swap cluster switch to double link list
      mm: swap: mTHP allocate swap entries from nonfull list

 include/linux/swap.h |  18 ++--
 mm/swapfile.c        | 252 +++++++++++++++++----------------------------------
 2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
-- 
Chris Li <chrisl@...nel.org>