[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkYzcxNbZLRL4DezWC27RqdEcfqr0Eafr2h0bUbacuRSjw@mail.gmail.com>
Date: Wed, 23 Oct 2024 10:59:42 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Kairui Song <kasong@...cent.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Chris Li <chrisl@...nel.org>, Barry Song <v-songbaohua@...o.com>,
Ryan Roberts <ryan.roberts@....com>, Hugh Dickins <hughd@...gle.com>,
"Huang, Ying" <ying.huang@...el.com>, Tim Chen <tim.c.chen@...ux.intel.com>,
Nhat Pham <nphamcs@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/13] mm, swap: rework of swap allocator locks
On Tue, Oct 22, 2024 at 12:29 PM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> This series improved the swap allocator performance greatly by reworking
> the locking design and simplify a lot of code path.
>
> This is follow up of previous swap cluster allocator series:
> https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> And this series is based on an follow up fix of the swap cluster
> allocator:
> https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/
>
> This is part of the new swap allocator work item discussed in
> Chris's "Swap Abstraction" discussion at LSF/MM 2024, and
> "mTHP and swap allocator" discussion at LPC 2024.
>
> Previous series introduced a fully cluster based allocation algorithm,
> this series completely get rid of the old allocation path and makes the
> allocator avoid grabbing the si->lock unless needed. This bring huge
> performance gain and get rid of slot cache on freeing path.
>
> Currently, swap locking is mainly composed of two locks, cluster lock
> (ci->lock) and device lock (si->lock). The device lock is widely used
> to protect many things, causing it to be the main bottleneck for SWAP.
>
> Cluster lock is much more fine-grained, so it will be best to use
> ci->lock instead of si->lock as much as possible.
>
> `perf lock` indicates this issue clearly. Doing linux kernel build
> using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
> pages), result of "perf lock contention -ab sleep 3":
>
> contended total wait max wait avg wait type caller
>
> 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350
> 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231
> 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59
> 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3
> 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59
> 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39
> ...snip...
>
> The top 5 caller are all users of si->lock, total wait time up sums to
> several minutes in the 3 seconds time window.
>
> Following the new allocator design, many operation doesn't need to touch
> si->lock at all. We only need to take si->lock when doing operations
> across multiple clusters (eg. changing the cluster list), other
> operations only need to take ci->lock. So ideally allocator should
> always take ci->lock first, then, if needed, take si->lock. But due
> to historical reasons, ci->lock is used inside si->lock by design,
> causing lock inversion if we simply try to acquire si->lock after
> acquiring ci->lock.
>
> This series audited all si->lock usage, simplify legacy codes, eliminate
> usage of si->lock as much as possible by introducing new designs based
> on the new cluster allocator.
>
> Old HDD allocation codes are removed, cluster allocator is adapted
> with small changes for HDD usage, test is looking OK.
>
> And this also removed slot cache for freeing path. The performance is
> better without it, and this enables other clean up and optimizations
> as discussed before:
> https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/
>
> After this series, lock contention on si->lock is nearly unobservable
> with `perf lock` with the same test above :
>
> contended total wait max wait avg wait type caller
> ... snip ...
> 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e
> ... snip ...
> 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e
> ... snip ...
> 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e
> ... snip ...
> 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d
> ... snip ...
>
> cluster_move and cluster_isolate_lock are basically the only users
> of si->lock now, performance gain is huge with reduced LOC.
>
> Tests
> ===
>
> Build kernel with defconfig on tmpfs with ZRAM as swap:
> ---
>
> Running a test matrix which is scaled up progressive for a intuitive result.
> The test are ran on top of tmpfs, using memory cgroup for memory limitation,
> on a 48c96t system.
>
> 12 test run for each case, it can be seen clearly that as concurrent job
> number goes higher the performance gain is higher, the performance is
> higher even with low concurrency.
>
> make -j<NR> | System Time (seconds) | Total Time (seconds)
> (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
> With 4k pages only:
> 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3%
> 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1%
> 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3%
> 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0%
> 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4%
> With 64k mTHP:
> 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8%
> 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5%
> 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8%
>
> With more aggressive setup, it shows clearly both the performance and
> fragmentation are better:
>
> tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 73578.30, Real time: 864.05
> tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM:
> After: (-54.7% sys time, -49.3% real time)
> Sys time: 33314.76, Real time: 437.67
>
> time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2:
> (avg of 4 test run)
> Before:
> Sys time: 74044.85, Real time: 846.51
> hugepages-64kB/stats/swpout: 1735216
> hugepages-64kB/stats/swpout_fallback: 430333
> After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure)
> Sys time: 35958.87, Real time: 442.69
> hugepages-64kB/stats/swpout: 1866267
> hugepages-64kB/stats/swpout_fallback: 158330
>
> There is a up to 54.7% improvement for build kernel test, and lower
> fragmentation rate. Performance improvement should be even larger for
> micro benchmarks
>
> Build kernel with tinyconfig on tmpfs with HDD as swap:
> ---
>
> This test is similar to above, but HDD test is very noisy and slow, the
> deviation is huge, so just use tinyconfig instead and take the median test
> result of 3 test run, which looks OK:
>
> Before this series:
> 114.44user 29.11system 39:42.90elapsed 6%CPU
> 2901232inputs+0outputs (238877major+4227640minor)pagefaults
>
> After this commit:
> 113.90user 23.81system 38:11.77elapsed 6%CPU
> 2548728inputs+0outputs (235471major+4238110minor)pagefaults
>
> Single thread SWAP:
> ---
>
> Sequential SWAP should also be slightly faster as we removed a lot of
> unnecessary parts. Test using micro benchmark for swapout/in 4G
> zero memory using ZRAM, 10 test runs:
>
> Swapout Before (avg. 3359304):
> 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776
>
> Swapin Before (avg. 1928698):
> 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155
>
> Swapout After (avg. 3347511, -0.4%):
> 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359
>
> Swapin After (avg. 1922290, -0.3%):
> 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913
Unfortunately I don't have the time to go through this series, but I
just wanted to say that this awesome work, Kairui.
Selfishly, I especially like cleaning up the swap slot freeing path,
and having a centralized freeing path with a single call to
zswap_invalidate().
Thanks for doing this :)
Powered by blists - more mailing lists