[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuNv06nCpxuvOt50rr4h1cco9Sk+g3nS_ximJFdo54f31Q@mail.gmail.com>
Date: Wed, 5 Jun 2024 00:30:38 -0700
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <ryncsn@...il.com>
Cc: "Huang, Ying" <ying.huang@...el.com>, Andrew Morton <akpm@...ux-foundation.org>,
Ryan Roberts <ryan.roberts@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Barry Song <baohua@...nel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
On Fri, May 31, 2024 at 5:40 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@...el.com> wrote:
> >
> > For specific configuration, I believe that we can get reasonable
> > high-order swap entry allocation success rate for specific use cases.
> > For example, if we only do limited maximum number order-0 swap entries
> > allocation, can we keep high-order clusters?
>
> Isn't limiting order-0 allocation breaks the bottom line that order-0
> allocation is the first class citizen, and should not fail if there is
> space?
We need to have high order and low order swap allocation working. Able
to recover from the swapfile full case.
>
> Just my two cents...
>
> I had a try locally based on Chris's work, allowing order 0 to use
> nonfull_clusters as Ying has suggested, and starting with low order
> and increase the order until nonfull_cluster[order] is not empty, that
> way higher order is just better protected, because unless we ran out
> of free_cluster and nonfull_cluster, direct scan won't happen.
That does not help the Android test case Barry is running because
Android tries to keep the swapfile full.
It will hit the case both empty and nonfull list all used up.
When it performs the low memory kill. There will be a big change in
the ratio of low vs high order swap.
Allocating high order swap entries should be able to recover from that.
>
> More concretely, I applied the following changes, which didn't change
> the code much:
> - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> free_clusters, then discard_cluster.
I consider high the nonfull list before the empty list. The current
allocation tries to make the HAS_CACHE only swap entry stay in the
disk for a longer time before recycling it. If the folio is still in
swap cache and not dirty, it can skip the write out and directly reuse
the swap slot during reclaim. I am not sure this code path is
important now, it seems when the swap slot is free, it will remove the
HAS_CACHE as well. BTW, I noticed that the discard cluster doesn't
check if the swap cache has a folio point to it. After discarding it
just set the swap_map to 0. I wonder if swap cache has a folio in that
discarded slot that would hit the skip writeback logic. If that is
triggerable, it would be a corruption bug.
The current SSD allocation also has some command said old SSD can
benefit from not writing to the same block too many times to help the
wear leveling. I don't think that is a big deal now, even cheap SD
cards have wear leveling nowadays.
> - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> returns false.
Ideally to have some option to reserve some high order swap space so
order 0 can't pollute high order clusters.
Chris
>
> A quick test still using the memtier test, but decreased the swap
> device size from 10G to 8g for higher pressure.
>
> Before:
> hugepages-32kB/stats/swpout:34013
> hugepages-32kB/stats/swpout_fallback:266
> hugepages-512kB/stats/swpout:0
> hugepages-512kB/stats/swpout_fallback:77
> hugepages-2048kB/stats/swpout:0
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35088
> hugepages-64kB/stats/swpout_fallback:66
> hugepages-16kB/stats/swpout:31848
> hugepages-16kB/stats/swpout_fallback:402
> hugepages-256kB/stats/swpout:390
> hugepages-256kB/stats/swpout_fallback:7244
> hugepages-128kB/stats/swpout:28573
> hugepages-128kB/stats/swpout_fallback:474
>
> After:
> hugepages-32kB/stats/swpout:31448
> hugepages-32kB/stats/swpout_fallback:3354
> hugepages-512kB/stats/swpout:30
> hugepages-512kB/stats/swpout_fallback:33
> hugepages-2048kB/stats/swpout:2
> hugepages-2048kB/stats/swpout_fallback:0
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:31255
> hugepages-64kB/stats/swpout_fallback:3112
> hugepages-16kB/stats/swpout:29931
> hugepages-16kB/stats/swpout_fallback:3397
> hugepages-256kB/stats/swpout:5223
> hugepages-256kB/stats/swpout_fallback:2351
> hugepages-128kB/stats/swpout:25600
> hugepages-128kB/stats/swpout_fallback:2194
>
> High order (256k) swapout rate are significantly higher, 512k is now
> possible, which indicate high orders are better protected, lower
> orders are sacrificed but seems worth it.
>
Powered by blists - more mailing lists