[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB5678BC6BBF8A4D7694EDDFDAC9692@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Wed, 25 Sep 2024 18:39:25 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: "Huang, Ying" <ying.huang@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"nphamcs@...il.com" <nphamcs@...il.com>, "chengming.zhou@...ux.dev"
<chengming.zhou@...ux.dev>, "usamaarif642@...il.com"
<usamaarif642@...il.com>, "shakeel.butt@...ux.dev" <shakeel.butt@...ux.dev>,
"ryan.roberts@....com" <ryan.roberts@....com>, "21cnbao@...il.com"
<21cnbao@...il.com>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"Zou, Nanhai" <nanhai.zou@...el.com>, "Feghali, Wajdi K"
<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> -----Original Message-----
> From: Huang, Ying <ying.huang@...el.com>
> Sent: Tuesday, September 24, 2024 11:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> shakeel.butt@...ux.dev; ryan.roberts@....com; 21cnbao@...il.com;
> akpm@...ux-foundation.org; Zou, Nanhai <nanhai.zou@...el.com>; Feghali,
> Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>
> Kanchana P Sridhar <kanchana.p.sridhar@...el.com> writes:
>
> [snip]
>
> >
> > Case 1: Comparing zswap 4K vs. zswap mTHP
> > =========================================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> > 64KB mTHP (cgroup memory.high set to 40G):
> > ==========================================
> >
> > -------------------------------------------------------------------------------
> > mm-unstable 9-23-2024 zswap-mTHP Change wrt
> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline
> > Baseline
> > -------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > -------------------------------------------------------------------------------
> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3%
> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1%
> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3%
> > memcg_high 132,743 169,825 148,075 192,744
> > memcg_swap_fail 639,067 841,553 2,204 2,215
> > pswpin 0 0 0 0
> > pswpout 0 0 0 0
> > zswpin 795 873 760 902
> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554
> > thp_swpout 0 0 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > 64kB-mthp_ 639,065 841,553 2,204 2,215
> > swpout_fallback
> > pgmajfault 2,861 2,924 3,054 3,259
> > ZSWPOUT-64kB n/a n/a 623,451 822,268
> > SWPOUT-64kB 0 0 0 0
> > -------------------------------------------------------------------------------
> >
>
> IIUC, the throughput is the sum of throughput of all usemem processes?
>
> One possible issue of usemem test case is the "imbalance" issue. That
> is, some usemem processes may swap-out/swap-in less, so the score is
> very high; while some other processes may swap-out/swap-in more, so the
> score is very low. Sometimes, the total score decreases, but the scores
> of usemem processes are more balanced, so that the performance should be
> considered better. And, in general, we should make usemem score
> balanced among processes via say longer test time. Can you check this
> in your test results?
Actually, the throughput data listed in the cover-letter is the average of
all the usemem processes. Your observation about the "imbalance" issue is
right. Some processes see a higher throughput than others. I have noticed
that the throughputs progressively reduce as the individual processes exit
and print their stats.
Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
enabled, zswap uses zstd.
-----------------------------------------------
sleep 10 sleep 30
Throughput (KB/s) Throughput (KB/s)
-----------------------------------------------
181,540 191,686
179,651 191,459
179,068 188,834
177,244 187,568
177,215 186,703
176,565 185,584
176,546 185,370
176,470 185,021
176,214 184,303
176,128 184,040
175,279 183,932
174,745 180,831
173,935 179,418
161,546 168,014
160,332 167,540
160,122 167,364
159,613 167,020
159,546 166,590
159,021 166,483
158,845 166,418
158,426 166,264
158,396 166,066
158,371 165,944
158,298 165,866
158,250 165,884
158,057 165,533
158,011 165,532
157,899 165,457
157,894 165,424
157,839 165,410
157,731 165,407
157,629 165,273
157,626 164,867
157,581 164,636
157,471 164,266
157,430 164,225
157,287 163,290
156,289 153,597
153,970 147,494
148,244 147,102
142,907 146,111
142,811 145,789
139,171 141,168
136,314 140,714
133,616 140,111
132,881 139,636
132,729 136,943
132,680 136,844
132,248 135,726
132,027 135,384
131,929 135,270
131,766 134,748
131,667 134,733
131,576 134,582
131,396 134,302
131,351 134,160
131,135 134,102
130,885 134,097
130,854 134,058
130,767 134,006
130,666 133,960
130,647 133,894
130,152 133,837
130,006 133,747
129,921 133,679
129,856 133,666
129,377 133,564
128,366 133,331
127,988 132,938
126,903 132,746
-----------------------------------------------
sum 10,526,916 10,919,561
average 150,385 155,994
stddev 17,551 19,633
-----------------------------------------------
elapsed 24.40 43.66
time (sec)
sys time 806.25 766.05
(sec)
zswpout 10,008,713 10,008,407
64K folio 623,463 623,629
swpout
-----------------------------------------------
As we increase the time for which allocations are maintained,
there seems to be a slight improvement in throughput, but the
variance increases as well. The processes with lower throughput
could be the ones that handle the memcg being over limit by
doing reclaim, possibly before they can allocate.
Interestingly, the longer test time does seem to reduce the amount
of reclaim (hence lower sys time), but more 64K large folios seem to
be reclaimed. Could this mean that with longer test time (sleep 30),
more cold memory residing in large folios is getting reclaimed, as
against memory just relinquished by the exiting processes?
Thanks,
Kanchana
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
Powered by blists - more mailing lists