[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB56784DD616B595FC707E5133C98D2@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Tue, 20 Aug 2024 03:00:52 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: "Huang, Ying" <ying.huang@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "yosryahmed@...gle.com" <yosryahmed@...gle.com>,
"nphamcs@...il.com" <nphamcs@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "Zou, Nanhai"
<nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
Hi Ying,
> -----Original Message-----
> From: Huang, Ying <ying.huang@...el.com>
> Sent: Sunday, August 18, 2024 10:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> ryan.roberts@....com; 21cnbao@...il.com; akpm@...ux-foundation.org;
> Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
>
> "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com> writes:
>
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@...el.com>
> >> Sent: Sunday, August 18, 2024 8:17 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> >> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> >> hannes@...xchg.org; yosryahmed@...gle.com; nphamcs@...il.com;
> >> ryan.roberts@....com; 21cnbao@...il.com; akpm@...ux-
> foundation.org;
> >> Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> >> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> >> Subject: Re: [PATCH v4 0/4] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@...el.com> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Performance Testing:
> >> > ====================
> >> > Testing of this patch-series was done with the v6.11-rc3 mainline,
> without
> >> > and with this patch-series, on an Intel Sapphire Rapids server,
> >> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >> >
> >> > The system has 503 GiB RAM, with a 4G SSD as the backing swap device
> for
> >> > ZSWAP. Core frequency was fixed at 2500MHz.
> >> >
> >> > The vm-scalability "usemem" test was run in a cgroup whose
> memory.high
> >> > was fixed. Following a similar methodology as in Ryan Roberts'
> >> > "Swap-out mTHP without splitting" series [2], 70 usemem processes
> were
> >> > run, each allocating and writing 1G of memory:
> >> >
> >> > usemem --init-time -w -O -n 70 1g
> >> >
> >> > Since I was constrained to get the 70 usemem processes to generate
> >> > swapout activity with the 4G SSD, I ended up using different cgroup
> >> > memory.high fixed limits for the experiments with 64K mTHP and 2M
> THP:
> >> >
> >> > 64K mTHP experiments: cgroup memory fixed at 60G
> >> > 2M THP experiments : cgroup memory fixed at 55G
> >> >
> >> > The vm/sysfs stats included after the performance data provide details
> >> > on the swapout activity to SSD/ZSWAP.
> >> >
> >> > Other kernel configuration parameters:
> >> >
> >> > ZSWAP Compressor : LZ4, DEFLATE-IAA
> >> > ZSWAP Allocator : ZSMALLOC
> >> > SWAP page-cluster : 2
> >> >
> >> > In the experiments where "deflate-iaa" is used as the ZSWAP
> compressor,
> >> > IAA "compression verification" is enabled. Hence each IAA compression
> >> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> >> > returned by the hardware will be compared and errors reported in case
> of
> >> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> >> > compared to the software compressors.
> >> >
> >> > Throughput reported by usemem and perf sys time for running the test
> >> > are as follows, averaged across 3 runs:
> >> >
> >> > 64KB mTHP (cgroup memory.high set to 60G):
> >> > ==========================================
> >> > ------------------------------------------------------------------
> >> > | | | | |
> >> > |Kernel | mTHP SWAP-OUT | Throughput | Improvement|
> >> > | | | KB/s | |
> >> > |--------------------|-------------------|------------|------------|
> >> > |v6.11-rc3 mainline | SSD | 335,346 | Baseline |
> >> > |zswap-mTHP-Store | ZSWAP lz4 | 271,558 | -19% |
> >>
> >> zswap throughput is worse than ssd swap? This doesn't look right.
> >
> > I realize it might look that way, however, this is not an apples-to-apples
> comparison,
> > as explained in the latter part of my analysis (after the 2M THP data tables).
> > The primary reason for this is because of running the test under a fixed
> > cgroup memory limit.
> >
> > In the "Before" scenario, mTHP get swapped out to SSD. However, the disk
> swap
> > usage is not accounted towards checking if the cgroup's memory limit has
> been
> > exceeded. Hence there are relatively fewer swap-outs, resulting mainly
> from the
> > 1G allocations from each of the 70 usemem processes working with a 60G
> memory
> > limit on the parent cgroup.
> >
> > However, the picture changes in the "After" scenario. mTHPs will now get
> stored in
> > zswap, which is accounted for in the cgroup's memory.current and counts
> > towards the fixed memory limit in effect for the parent cgroup. As a result,
> when
> > mTHP get stored in zswap, the mTHP compressed data in the zswap zpool
> now
> > count towards the cgroup's active memory and memory limit. This is in
> addition
> > to the 1G allocations from each of the 70 processes.
> >
> > As you can see, this creates more memory pressure on the cgroup, resulting
> in
> > more swap-outs. With lz4 as the zswap compressor, this results in lesser
> throughput
> > wrt "Before".
> >
> > However, with IAA as the zswap compressor, the throughout with zswap
> mTHP is
> > better than "Before" because of better hardware compress latencies, which
> handle
> > the higher swap-out activity without compromising on throughput.
> >
> >>
> >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 388,154 | 16% |
> >> > |------------------------------------------------------------------|
> >> > | | | | |
> >> > |Kernel | mTHP SWAP-OUT | Sys time | Improvement|
> >> > | | | sec | |
> >> > |--------------------|-------------------|------------|------------|
> >> > |v6.11-rc3 mainline | SSD | 91.37 | Baseline |
> >> > |zswap-mTHP=Store | ZSWAP lz4 | 265.43 | -191% |
> >> > |zswap-mTHP-Store | ZSWAP deflate-iaa | 235.60 | -158% |
> >> > ------------------------------------------------------------------
> >> >
> >> > -----------------------------------------------------------------------
> >> > | VMSTATS, mTHP ZSWAP/SSD stats| v6.11-rc3 | zswap-mTHP |
> zswap-
> >> mTHP |
> >> > | | mainline | Store | Store |
> >> > | | | lz4 | deflate-iaa |
> >> > |-----------------------------------------------------------------------|
> >> > | pswpin | 0 | 0 | 0 |
> >> > | pswpout | 174,432 | 0 | 0 |
> >> > | zswpin | 703 | 534 | 721 |
> >> > | zswpout | 1,501 | 1,491,654 | 1,398,805 |
> >>
> >> It appears that the number of swapped pages for zswap is much larger
> >> than that of SSD swap. Why? I guess this is why zswap throughput is
> >> worse.
> >
> > Your observation is correct. I hope the above explanation helps as to the
> > reasoning behind this.
>
> Before:
> (174432 + 1501) * 4 / 1024 = 687.2 MB
>
> After:
> 1491654 * 4.0 / 1024 = 5826.8 MB
>
> From your previous words, 10GB memory should be swapped out.
>
> Even if the average compression ratio is 0, the swap-out count of zswap
> should be about 100% more than that of SSD. However, the ratio here
> appears unreasonable.
Excellent point! In order to understand this better myself, I ran usemem with
1 process that tries to allocate 58G:
cgroup memory.high = 60,000,000,000
usemem --init-time -w -O -n 1 58g
usemem -n 1 58g Before After
----------------------------------------------
pswpout 586,352 0
zswpout 1,005 1,042,963
----------------------------------------------
Total swapout 587,357 1,042,963
----------------------------------------------
In the case where the cgroup has only 1 process, your rationale above applies
(more or less). This shows the stats collected every 100 micro-seconds from the critical
section of the workload right before the memory limit is reached (Before and After):
===========================================================================
BEFORE zswap_store mTHP:
===========================================================================
cgroup_memory cgroup_memory zswap_pool zram_compr
w/o zswap _total_size _data_size
---------------------------------------------------------------------------
59,999,600,640 59,999,600,640 0 74
59,999,911,936 59,999,911,936 0 14,139,441
60,000,083,968 59,997,634,560 2,449,408 53,448,205
59,999,952,896 59,997,503,488 2,449,408 93,477,490
60,000,083,968 59,997,634,560 2,449,408 133,152,754
60,000,083,968 59,997,634,560 2,449,408 172,628,328
59,999,952,896 59,997,503,488 2,449,408 212,760,840
60,000,083,968 59,997,634,560 2,449,408 251,999,675
60,000,083,968 59,997,634,560 2,449,408 291,058,130
60,000,083,968 59,997,634,560 2,449,408 329,655,206
59,999,793,152 59,997,343,744 2,449,408 368,938,904
59,999,924,224 59,997,474,816 2,449,408 408,652,723
59,999,924,224 59,997,474,816 2,449,408 447,830,071
60,000,055,296 59,997,605,888 2,449,408 487,776,082
59,999,924,224 59,997,474,816 2,449,408 526,826,360
60,000,055,296 59,997,605,888 2,449,408 566,193,520
60,000,055,296 59,997,605,888 2,449,408 604,625,879
60,000,055,296 59,997,605,888 2,449,408 642,545,706
59,999,924,224 59,997,474,816 2,449,408 681,958,173
59,999,924,224 59,997,474,816 2,449,408 721,908,162
59,999,924,224 59,997,474,816 2,449,408 761,935,307
59,999,924,224 59,997,474,816 2,449,408 802,014,594
59,999,924,224 59,997,474,816 2,449,408 842,087,656
59,999,924,224 59,997,474,816 2,449,408 883,889,588
59,999,924,224 59,997,474,816 2,449,408 804,458,184
59,999,793,152 59,997,343,744 2,449,408 94,150,548
54,938,513,408 54,936,064,000 2,449,408 172,644
29,492,523,008 29,490,073,600 2,449,408 172,644
3,465,621,504 3,463,172,096 2,449,408 131,457
---------------------------------------------------------------------------
===========================================================================
AFTER zswap_store mTHP:
===========================================================================
cgroup_memory cgroup_memory zswap_pool
w/o zswap _total_size
---------------------------------------------------------------------------
55,578,234,880 55,578,234,880 0
56,104,095,744 56,104,095,744 0
56,644,898,816 56,644,898,816 0
57,184,653,312 57,184,653,312 0
57,706,057,728 57,706,057,728 0
58,226,937,856 58,226,937,856 0
58,747,293,696 58,747,293,696 0
59,275,776,000 59,275,776,000 0
59,793,772,544 59,793,772,544 0
60,000,141,312 60,000,141,312 0
59,999,956,992 59,999,956,992 0
60,000,169,984 60,000,169,984 0
59,999,907,840 59,951,226,880 48,680,960
60,000,169,984 59,900,010,496 100,159,488
60,000,169,984 59,848,007,680 152,162,304
60,000,169,984 59,795,513,344 204,656,640
59,999,907,840 59,743,477,760 256,430,080
60,000,038,912 59,692,097,536 307,941,376
60,000,169,984 59,641,208,832 358,961,152
60,000,038,912 59,589,992,448 410,046,464
60,000,169,984 59,539,005,440 461,164,544
60,000,169,984 59,487,657,984 512,512,000
60,000,038,912 59,434,868,736 565,170,176
60,000,038,912 59,383,259,136 616,779,776
60,000,169,984 59,331,518,464 668,651,520
60,000,169,984 59,279,843,328 720,326,656
60,000,169,984 59,228,626,944 771,543,040
59,999,907,840 59,176,984,576 822,923,264
60,000,038,912 59,124,326,400 875,712,512
60,000,169,984 59,072,454,656 927,715,328
60,000,169,984 59,020,156,928 980,013,056
60,000,038,912 58,966,974,464 1,033,064,448
60,000,038,912 58,913,628,160 1,086,410,752
60,000,038,912 58,858,840,064 1,141,198,848
60,000,169,984 58,804,314,112 1,195,855,872
59,999,907,840 58,748,936,192 1,250,971,648
60,000,169,984 58,695,131,136 1,305,038,848
60,000,169,984 58,642,800,640 1,357,369,344
60,000,169,984 58,589,782,016 1,410,387,968
60,000,038,912 58,535,124,992 1,464,913,920
60,000,169,984 58,482,925,568 1,517,244,416
60,000,169,984 58,429,775,872 1,570,394,112
60,000,038,912 58,376,658,944 1,623,379,968
60,000,169,984 58,323,247,104 1,676,922,880
60,000,038,912 58,271,113,216 1,728,925,696
60,000,038,912 58,216,292,352 1,783,746,560
60,000,038,912 58,164,289,536 1,835,749,376
60,000,038,912 58,112,090,112 1,887,948,800
60,000,038,912 58,058,350,592 1,941,688,320
59,999,907,840 58,004,971,520 1,994,936,320
60,000,169,984 57,953,165,312 2,047,004,672
59,999,907,840 57,900,277,760 2,099,630,080
60,000,038,912 57,847,586,816 2,152,452,096
60,000,169,984 57,793,421,312 2,206,748,672
59,999,907,840 57,741,582,336 2,258,325,504
60,012,826,624 57,734,840,320 2,277,986,304
60,098,793,472 57,820,348,416 2,278,445,056
60,176,334,848 57,897,889,792 2,278,445,056
60,269,826,048 57,991,380,992 2,278,445,056
59,687,481,344 57,851,977,728 1,835,503,616
59,049,836,544 57,888,108,544 1,161,728,000
58,406,068,224 57,929,551,872 476,516,352
43,837,923,328 43,837,919,232 4,096
18,124,546,048 18,124,541,952 4,096
2,846,720 2,842,624 4,096
---------------------------------------------------------------------------
I have also attached plots of the memory pressure reported by PSI. Both these sets
of data should give a sense of the added memory pressure on the cgroup because of
zswap mTHP stores. The data shows that the cgroup is over the limit much more
frequently in the "After" than in "Before". However, the rationale that you suggested
seems more reasonable and apparent in the 1 process case.
However, with 70 processes trying to allocate 1G, things get more complicated.
These are the functions that should provide more clarity:
[1] mm/memcontrol.c: mem_cgroup_handle_over_high().
[2] mm/memcontrol.c: try_charge_memcg().
[3] include/linux/resume_user_mode.h: resume_user_mode_work().
At a high level, when zswap mTHP compressed pool usage starts counting towards
cgroup.memory.current, there are two inter-related effects occurring that ultimately
cause more reclaim to happen:
1) When each process reclaims a folio and zswap_store() writes out each page in
the folio, it charges the compressed size to the memcg
"obj_cgroup_charge_zswap(objcg, entry->length);". This calls [2] and sets
current->memcg_nr_pages_over_high if the limit is exceeded. The comments
towards the end of [2] are relevant.
2) When each of the processes returns from a page-fault, it checks if the
cgroup memory usage is over the limit in [3], and if so, it will trigger
reclaim.
I confirmed that in the case of usemem, all calls to [1] occur from the code path in [3].
However, my takeaway from this is that the more reclaim that results in zswap_store(),
for e.g., from mTHP folios, there is higher likelihood of overage recorded per-process in
current->memcg_nr_pages_over_high, which could potentially be causing each
process to reclaim memory, even if it is possible that the swapout from a few of
the 70 processes could have brought the parent cgroup under the limit.
Please do let me know if you have any other questions. Appreciate your feedback
and comments.
Thanks,
Kanchana
>
> --
> Best Regards,
> Huang, Ying
>
> > Thanks,
> > Kanchana
> >
> >>
> >> > |-----------------------------------------------------------------------|
> >> > | thp_swpout | 0 | 0 | 0 |
> >> > | thp_swpout_fallback | 0 | 0 | 0 |
> >> > | pgmajfault | 3,364 | 3,650 | 3,431 |
> >> > |-----------------------------------------------------------------------|
> >> > | hugepages-64kB/stats/zswpout | | 63,200 | 63,244 |
> >> > |-----------------------------------------------------------------------|
> >> > | hugepages-64kB/stats/swpout | 10,902 | 0 | 0 |
> >> > -----------------------------------------------------------------------
> >> >
> >>
> >> [snip]
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
Download attachment "usemem_n1_Before.jpg" of type "image/jpeg" (59433 bytes)
Download attachment "usemem_n1_After.jpg" of type "image/jpeg" (66448 bytes)
Powered by blists - more mailing lists