[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wGK6pu+KNhYjpWgydp6DyjH5tE=9+mje3UyrXdFJOuNw@mail.gmail.com>
Date: Tue, 27 Aug 2024 07:46:19 +1200
From: Barry Song <21cnbao@...il.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: hanchuanhua@...o.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
baolin.wang@...ux.alibaba.com, chrisl@...nel.org, david@...hat.com,
hannes@...xchg.org, hughd@...gle.com, kaleshsingh@...gle.com,
kasong@...cent.com, linux-kernel@...r.kernel.org, mhocko@...e.com,
minchan@...nel.org, nphamcs@...il.com, ryan.roberts@....com,
senozhatsky@...omium.org, shy828301@...il.com, surenb@...gle.com,
v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
ying.huang@...el.com, yosryahmed@...gle.com, hch@...radead.org,
ryncsn@...il.com, Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH v7 2/2] mm: support large folios swap-in for sync io devices
On Sat, Aug 24, 2024 at 5:56 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> Hi Barry,
>
> On Thu, Aug 22, 2024 at 05:13:06AM GMT, Barry Song wrote:
> > On Thu, Aug 22, 2024 at 1:31 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 03:45:40PM GMT, hanchuanhua@...o.com wrote:
> > > > From: Chuanhua Han <hanchuanhua@...o.com>
> > > >
> > > >
> > > > 3. With both mTHP swap-out and swap-in supported, we offer the option to enable
> > > > zsmalloc compression/decompression with larger granularity[2]. The upcoming
> > > > optimization in zsmalloc will significantly increase swap speed and improve
> > > > compression efficiency. Tested by running 100 iterations of swapping 100MiB
> > > > of anon memory, the swap speed improved dramatically:
> > > > time consumption of swapin(ms) time consumption of swapout(ms)
> > > > lz4 4k 45274 90540
> > > > lz4 64k 22942 55667
> > > > zstdn 4k 85035 186585
> > > > zstdn 64k 46558 118533
> > >
> > > Are the above number with the patch series at [2] or without? Also can
> > > you explain your experiment setup or how can someone reproduce these?
> >
> > Hi Shakeel,
> >
> > The data was recorded after applying both this patch (swap-in mTHP) and
> > patch [2] (compressing/decompressing mTHP instead of page). However,
> > without the swap-in series, patch [2] becomes useless because:
> >
> > If we have a large object, such as 16 pages in zsmalloc:
> > do_swap_page will happen 16 times:
> > 1. decompress the whole large object and copy one page;
> > 2. decompress the whole large object and copy one page;
> > 3. decompress the whole large object and copy one page;
> > ....
> > 16. decompress the whole large object and copy one page;
> >
> > So, patchset [2] will actually degrade performance rather than
> > enhance it if we don't have this swap-in series. This swap-in
> > series is a prerequisite for the zsmalloc/zram series.
>
> Thanks for the explanation.
>
> >
> > We reproduced the data through the following simple steps:
> > 1. Collected anonymous pages from a running phone and saved them to a file.
> > 2. Used a small program to open and read the file into a mapped anonymous
> > memory.
> > 3. Do the belows in the small program:
> > swapout_start_time
> > madv_pageout()
> > swapout_end_time
> >
> > swapin_start_time
> > read_data()
> > swapin_end_time
> >
> > We calculate the throughput of swapout and swapin using the difference between
> > end_time and start_time. Additionally, we record the memory usage of zram after
> > the swapout is complete.
> >
>
> Please correct me if I am wrong but you are saying in your experiment,
> 100 MiB took 90540 ms to compress/swapout and 45274 ms to
> decompress/swapin if backed by 4k pages but took 55667 ms and 22942 ms
> if backed by 64k pages. Basically the table shows total time to compress
> or decomress 100 MiB of memory, right?
Hi Shakeel,
Tangquan(CC'd) collected the data and double-checked the case to confirm
the answer to your question.
We have three cases:
1. no mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
2. have mTHP swap-in, no zsmalloc/zram multi-pages compression/decompression
3. have mTHP swap-in, have zsmalloc/zram multi-pages compression/decompression
The data was 1 vs 3.
To provide more precise data that covers each change, Tangquan tested
1 vs. 2 and
2 vs. 3 yesterday using LZ4 (the hardware might differ from the
previous test, but the
data shows the same trend) per my request.
1. no mTHP swapin, no zsmalloc/zram patch
swapin_ms. 30336
swapout_ms. 65651
2. have mTHP swapin, no zsmalloc/zram patch
swapin_ms. 27161
swapout_ms. 61135
3. have mTHP swapin, have zsmalloc/zram patch
swapin_ms. 13683
swapout_ms. 43305
The test pseudocode is as follows:
addr=mmap(100M)
read_anon_data_from_file_to addr();
for(i=0;i<100;i++) {
swapout_start_time;
madv_pageout();
swapout_end_time;
swapin_start_time;
read_addr_to_swapin();
swapin_end_time;
}
So, while we saw some improvement from 1 to 2, the significant gains
come from using large blocks for compression and decompression.
This mTHP swap-in series ensures that mTHPs aren't lost after the first swap-in,
so the following 99 iterations continue to involve THP swap-out and
mTHP swap-in.
The improvement from 1 to 2 is due to this mTHP swap-in series, while the
improvement from 2 to 3 comes from the zsmalloc/zram patchset [2] you
mentioned.
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
> > >
> >
Thanks
Barry
Powered by blists - more mailing lists