lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 31 May 2024 20:40:11 +0800
From: Kairui Song <ryncsn@...il.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Chris Li <chrisl@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Ryan Roberts <ryan.roberts@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Barry Song <baohua@...nel.org>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@...el.com> wrote:
>
> Chris Li <chrisl@...nel.org> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@...el.com> wrote:
> >>
> >> Chris Li <chrisl@...nel.org> writes:
> >>
> >> > Hi Ying,
> >> >
> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@...el.com> wrote:
> >> >>
> >> >> Chris Li <chrisl@...nel.org> writes:
> >> >>
> >> >> > I am spinning a new version for this series to address two issues
> >> >> > found in this series:
> >> >> >
> >> >> > 1) Oppo discovered a bug in the following line:
> >> >> > +               ci = si->cluster_info + tmp;
> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> >> > That is a serious bug but trivial to fix.
> >> >> >
> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> >> > the cluster->order.
> >> >>
> >> >> IIUC, now, we only scan swap_map[] only if
> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> >> That is, if you doesn't run low swap free space, you will not do that.
> >> >
> >> > You can still swap space in order 0 clusters while order 4 runs out of
> >> > free_cluster
> >> > or nonfull_clusters[order]. For Android that is a common case.
> >>
> >> When we fail to allocate order 4, we will fallback to order 0.  Still
> >> don't need to scan swap_map[].  But after looking at your below reply, I
> >> realized that the swap space is almost full at most times in your cases.
> >> Then, it's possible that we run into scanning swap_map[].
> >> list_empty(&si->free_clusters) &&
> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> >> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> >> space runs low.  And maybe take clusters out of si->percpu_cluster
> >> sometimes.
> >
> > One idea after reading your reply. If we run out of the
> > nonfull_cluster[order], we should be able to use other cpu's
> > si->percpu_cluster[] as well. That is a very small win for Android,
>
> This will be useful in general.  The number CPU may be large, and
> multiple orders may be used.
>
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason.  I
> suspect that there are some non-full cluster collection issues.
>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2].  In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
> >> perfect, but it is the simplest first step implementation.  You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between.  If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized.  And, OOM with free swap space isn't
> acceptable.  Please consider the policy we used for page allocation.
>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally.  But we are
> too far from that.
>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Isn't limiting order-0 allocation breaks the bottom line that order-0
allocation is the first class citizen, and should not fail if there is
space?

Just my two cents...

I had a try locally based on Chris's work, allowing order 0 to use
nonfull_clusters as Ying has suggested, and starting with low order
and increase the order until nonfull_cluster[order] is not empty, that
way higher order is just better protected, because unless we ran out
of free_cluster and nonfull_cluster, direct scan won't happen.

More concretely, I applied the following changes, which didn't change
the code much:
- In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
free_clusters, then discard_cluster.
- If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
returns false.

A quick test still using the memtier test, but decreased the swap
device size from 10G to 8g for higher pressure.

Before:
hugepages-32kB/stats/swpout:34013
hugepages-32kB/stats/swpout_fallback:266
hugepages-512kB/stats/swpout:0
hugepages-512kB/stats/swpout_fallback:77
hugepages-2048kB/stats/swpout:0
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35088
hugepages-64kB/stats/swpout_fallback:66
hugepages-16kB/stats/swpout:31848
hugepages-16kB/stats/swpout_fallback:402
hugepages-256kB/stats/swpout:390
hugepages-256kB/stats/swpout_fallback:7244
hugepages-128kB/stats/swpout:28573
hugepages-128kB/stats/swpout_fallback:474

After:
hugepages-32kB/stats/swpout:31448
hugepages-32kB/stats/swpout_fallback:3354
hugepages-512kB/stats/swpout:30
hugepages-512kB/stats/swpout_fallback:33
hugepages-2048kB/stats/swpout:2
hugepages-2048kB/stats/swpout_fallback:0
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:31255
hugepages-64kB/stats/swpout_fallback:3112
hugepages-16kB/stats/swpout:29931
hugepages-16kB/stats/swpout_fallback:3397
hugepages-256kB/stats/swpout:5223
hugepages-256kB/stats/swpout_fallback:2351
hugepages-128kB/stats/swpout:25600
hugepages-128kB/stats/swpout_fallback:2194

High order (256k) swapout rate are significantly higher, 512k is now
possible, which indicate high orders are better protected, lower
orders are sacrificed but seems worth it.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ